Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PartialEq for slices #26884

Merged
merged 1 commit into from
Jul 9, 2015
Merged

Improve PartialEq for slices #26884

merged 1 commit into from
Jul 9, 2015

Conversation

dotdash
Copy link
Contributor

@dotdash dotdash commented Jul 8, 2015

Exploiting the fact that getting the length of the slices is known, we
can use a counted loop instead of iterators, which means that we only
need a single counter, instead of having to increment and check one
pointer for each iterator.

Benchmarks comparing vectors with 100,000 elements:

Before:

running 8 tests
test eq1_u8  ... bench:      66,757 ns/iter (+/- 113)
test eq2_u16 ... bench:     111,267 ns/iter (+/- 149)
test eq3_u32 ... bench:     126,282 ns/iter (+/- 111)
test eq4_u64 ... bench:     126,418 ns/iter (+/- 155)
test ne1_u8  ... bench:      88,990 ns/iter (+/- 161)
test ne2_u16 ... bench:      89,126 ns/iter (+/- 265)
test ne3_u32 ... bench:      96,901 ns/iter (+/- 92)
test ne4_u64 ... bench:      96,750 ns/iter (+/- 137)

After:

running 8 tests
test eq1_u8  ... bench:      46,413 ns/iter (+/- 521)
test eq2_u16 ... bench:      46,500 ns/iter (+/- 74)
test eq3_u32 ... bench:      50,059 ns/iter (+/- 92)
test eq4_u64 ... bench:      54,001 ns/iter (+/- 92)
test ne1_u8  ... bench:      47,595 ns/iter (+/- 53)
test ne2_u16 ... bench:      47,521 ns/iter (+/- 59)
test ne3_u32 ... bench:      44,889 ns/iter (+/- 74)
test ne4_u64 ... bench:      47,775 ns/iter (+/- 68)

Exploiting the fact that getting the length of the slices is known, we
can use a counted loop instead of iterators, which means that we only
need a single counter, instead of having to increment and check one
pointer for each iterator.

Benchmarks comparing vectors with 100,000 elements:

Before:

```
running 8 tests
test eq1_u8  ... bench:      66,757 ns/iter (+/- 113)
test eq2_u16 ... bench:     111,267 ns/iter (+/- 149)
test eq3_u32 ... bench:     126,282 ns/iter (+/- 111)
test eq4_u64 ... bench:     126,418 ns/iter (+/- 155)
test ne1_u8  ... bench:      88,990 ns/iter (+/- 161)
test ne2_u16 ... bench:      89,126 ns/iter (+/- 265)
test ne3_u32 ... bench:      96,901 ns/iter (+/- 92)
test ne4_u64 ... bench:      96,750 ns/iter (+/- 137)
```

After:

```
running 8 tests
test eq1_u8  ... bench:      46,413 ns/iter (+/- 521)
test eq2_u16 ... bench:      46,500 ns/iter (+/- 74)
test eq3_u32 ... bench:      50,059 ns/iter (+/- 92)
test eq4_u64 ... bench:      54,001 ns/iter (+/- 92)
test ne1_u8  ... bench:      47,595 ns/iter (+/- 53)
test ne2_u16 ... bench:      47,521 ns/iter (+/- 59)
test ne3_u32 ... bench:      44,889 ns/iter (+/- 74)
test ne4_u64 ... bench:      47,775 ns/iter (+/- 68)
```
@rust-highfive
Copy link
Collaborator

r? @pcwalton

(rust_highfive has picked a reviewer for you, use r? to override)

@alexcrichton
Copy link
Member

@bors: r+ 9f4d5b4

Nice wins!

@bluss bluss added the relnotes Marks issues that should be documented in the release notes of the next release. label Jul 8, 2015
bors added a commit that referenced this pull request Jul 9, 2015
Exploiting the fact that getting the length of the slices is known, we
can use a counted loop instead of iterators, which means that we only
need a single counter, instead of having to increment and check one
pointer for each iterator.

Benchmarks comparing vectors with 100,000 elements:

Before:

```
running 8 tests
test eq1_u8  ... bench:      66,757 ns/iter (+/- 113)
test eq2_u16 ... bench:     111,267 ns/iter (+/- 149)
test eq3_u32 ... bench:     126,282 ns/iter (+/- 111)
test eq4_u64 ... bench:     126,418 ns/iter (+/- 155)
test ne1_u8  ... bench:      88,990 ns/iter (+/- 161)
test ne2_u16 ... bench:      89,126 ns/iter (+/- 265)
test ne3_u32 ... bench:      96,901 ns/iter (+/- 92)
test ne4_u64 ... bench:      96,750 ns/iter (+/- 137)
```

After:

```
running 8 tests
test eq1_u8  ... bench:      46,413 ns/iter (+/- 521)
test eq2_u16 ... bench:      46,500 ns/iter (+/- 74)
test eq3_u32 ... bench:      50,059 ns/iter (+/- 92)
test eq4_u64 ... bench:      54,001 ns/iter (+/- 92)
test ne1_u8  ... bench:      47,595 ns/iter (+/- 53)
test ne2_u16 ... bench:      47,521 ns/iter (+/- 59)
test ne3_u32 ... bench:      44,889 ns/iter (+/- 74)
test ne4_u64 ... bench:      47,775 ns/iter (+/- 68)
```
@bors
Copy link
Contributor

bors commented Jul 9, 2015

⌛ Testing commit 9f4d5b4 with merge 66b9277...

@bors bors merged commit 9f4d5b4 into rust-lang:master Jul 9, 2015
@lilianmoraru
Copy link

That's interesting. Now I am just a little bit confused.
I thought that using iterators is faster in Rust than using for loops because of bounds checking.
Long time ago I saw a benchmark style test that was initially using a for loop and had pretty bad performance compared to C and C++, then somebody else switched it to iterators and the performance was comparable...

@dotdash
Copy link
Contributor Author

dotdash commented Jul 14, 2015

@lilianmoraru In this case, the bounds checks get eliminated by the optimizer, because it can easily prove that it would never trigger, so there's no advantage for iterators here.

But using the zipped iterators, you get two pointers that both need to be incremented and checked, while the counted loop needs to increment and check just a single integer. Different problem ;-)

@dotdash dotdash deleted the fast branch July 27, 2015 08:49
@ranma42 ranma42 mentioned this pull request Sep 16, 2015
ranma42 added a commit to ranma42/rust that referenced this pull request Sep 16, 2015
Reusing the same idea as in rust-lang#26884, we can exploit the fact that the
length of slices is known, hence we can use a counted loop instead of
iterators, which means that we only need a single counter, instead of
having to increment and check one pointer for each iterator.

Using the generic implementation of the boolean comparison operators
(`lt`, `le`, `gt`, `ge`) provides further speedup for simple
types. This happens because the loop scans elements checking for
equality and dispatches to element comparison or length comparison
depending on the result of the prefix comparison.

```
test u8_cmp          ... bench:      14,043 ns/iter (+/- 1,732)
test u8_lt           ... bench:      16,156 ns/iter (+/- 1,864)
test u8_partial_cmp  ... bench:      16,250 ns/iter (+/- 2,608)
test u16_cmp         ... bench:      15,764 ns/iter (+/- 1,420)
test u16_lt          ... bench:      19,833 ns/iter (+/- 2,826)
test u16_partial_cmp ... bench:      19,811 ns/iter (+/- 2,240)
test u32_cmp         ... bench:      15,792 ns/iter (+/- 3,409)
test u32_lt          ... bench:      18,577 ns/iter (+/- 2,075)
test u32_partial_cmp ... bench:      18,603 ns/iter (+/- 5,666)
test u64_cmp         ... bench:      16,337 ns/iter (+/- 2,511)
test u64_lt          ... bench:      18,074 ns/iter (+/- 7,914)
test u64_partial_cmp ... bench:      17,909 ns/iter (+/- 1,105)
```

```
test u8_cmp          ... bench:       6,511 ns/iter (+/- 982)
test u8_lt           ... bench:       6,671 ns/iter (+/- 919)
test u8_partial_cmp  ... bench:       7,118 ns/iter (+/- 1,623)
test u16_cmp         ... bench:       6,689 ns/iter (+/- 921)
test u16_lt          ... bench:       6,712 ns/iter (+/- 947)
test u16_partial_cmp ... bench:       6,725 ns/iter (+/- 780)
test u32_cmp         ... bench:       7,704 ns/iter (+/- 1,294)
test u32_lt          ... bench:       7,611 ns/iter (+/- 3,062)
test u32_partial_cmp ... bench:       7,640 ns/iter (+/- 1,149)
test u64_cmp         ... bench:       7,517 ns/iter (+/- 2,164)
test u64_lt          ... bench:       7,579 ns/iter (+/- 1,048)
test u64_partial_cmp ... bench:       7,629 ns/iter (+/- 1,195)
```
bors added a commit that referenced this pull request Sep 16, 2015
This branch improves the performance of Ord and PartialOrd methods for slices compared to the iter-based implementation.
Based on the approach used in #26884.
}
}

true
}
fn ne(&self, other: &[B]) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this has already been merged and looks fantastic but I, as a Rust n00b, had a question as to why this isn't just

fn ne(&self, other: &[B]) -> bool {
    !self.eq(other);
}

Is it a performance issue to avoid the duplication?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even just drop this and use the default implementation of ne() from the trait definition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be interested to know this as well. Could it be because there is call to ne on line 1480, which may not be the same as !eq?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for PartialEq states: "Any manual implementation of ne must respect the rule that eq is a strict inverse of ne; that is, !(a == b) if and only if a != b."
Under this assumption, ne can always re-use the implementation from the trait definition. I would expect the compiler to do a good job optimising it as it can fold the negation either on the possible outcomes or on the ne usage. Unless some benchmarking suggests otherwise, I would assume that avoiding the duplication is preferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we'll have to wait to hear back from @dotdash then :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember exactly why I didn't change that, but probably I was just too lazy to come up with enough useful testcases (involving at least various calling- and inlining-scenarios) to make sure that relying on the simple negation implementation doesn't affect performance negatively. Benchmarking welcome!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can find some time this weekend to test it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants