-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the slice iterator's searching methods #37972
Conversation
r? @sfackler (rust_highfive has picked a reviewer for you, use r? to override) |
benchmarks: https://gist.github.com/bluss/3fed8a20667d525a5f8a4713442fe540 (updated)
|
} | ||
} | ||
false | ||
!self.all(move |x| !f(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change in semantics for code like this:
impl Iterator for _ {
fn all(...) -> bool where ... { !self.any(move |x| !f(x)) }
// no `any` def’n.
}
Currently it would just work, but after this change it would become an infinite loop.
I noticed this, because this is a common issue in Haskell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice catch. Indeed, it ICEs with a massive recursive type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed that part from the PR entirely.
Well it looks like there is something real broken from the travis tests.. (Oh.. ZST offsetting is broken! Edit: And now fixed, of course.) |
4c90af8
to
57429b6
Compare
nom's benchmark improves with the PR (some kind of http test) (commit 2e2730cdb451a55)
cc @Geal Do you have more nom benchmarks? This PR is kind of relevant for nom, I hope? |
@bluss indeed, it is relevant. I have other benchmarks here: https://github.com/Geal/nom_benchmarks |
chomp benchmarks cc @m4rw3r (commit d838dd267611edc)
|
@bluss I updated the benchmarks on https://github.com/Geal/nom_benchmarks , they should work with nom master now |
Ok, I tried to run the benchmarks I found. nom-benchmarks commit 5246454ce735d4fb8c mp4
nom-http
|
Use an extension trait for the slice iterator's pointer manipulations.
Introduce a helper method .search_while() that generalizes internal iteration (Iterator's all, find, position, fold and so on). The compiler does not unroll loops with conditional exits; we can do this manually instead to improve the performance of for example Iterator::find and Iterator::position when used on the slice iterators. The unrolling is patterned on libstdc++'s implementation of std::find_if.
90fd57d
to
a54ddfb
Compare
I started with the idea of general internal iteration, but now it's been simplified into something better. It's |
Nice! Is this a general optimization though? I guess if the FnMut is large LLVM will just refuse to inline it. |
Good question. There would be cases where this manual unrolling is not the right thing to do, but very often the search's predicate is simple. Hidden in commit 3's log it says “The unrolling is patterned on libstdc++'s implementation of std::find_if.”; so there is a precedent, that library unrolls by 4 explicitly as well. |
It's probably worth testing the behavior when the fn is complex. It's hardly the common case for filters though. |
I would prefer to find an existing workload for that rather than something synthetic. This explicit unroll should be removed again when llvm does this by itself. Until then we have a lot to win, it pains me to see byte-by-byte loops non-unrolled, and it has some significant wins for the parsing libraries. |
I found another parsing library, gimli (commit ccc49fa5d4cd2) cc @philipc and it has some improvements and regressions, I couldn't quite understand exactly why two of the cases regressed.
One of the main beneficiaries ( |
ptrdistance can be expensive for non-power-of-2 sizes. What about specializing the PR for the numerical types (char, ints, floats) and small {2}arrays/tuples? |
I'm ok with that, if we can introduce that conditional without making it too messy. It could be as simple as using unrolling only for small elements ("size_of <= Upstream llvm bug for not unrolling simple loops. https://llvm.org/bugs/show_bug.cgi?id=27360 clang does the same thing, no unrolling #include<vector>
std::size_t find_zero(std::vector<char>& v) {
auto it = v.begin();
for (; it != v.end(); ++it) {
if (!*it) {
break;
}
}
return std::distance(v.begin(), it);
} ; Function Attrs: norecurse nounwind readonly uwtable
define i64 @_Z9find_zeroRSt6vectorIcSaIcEE(%"class.std::vector"* nocapture readonly dereferenceable(24) %v) #0 {
%1 = getelementptr inbounds %"class.std::vector", %"class.std::vector"* %v, i64 0, i32 0, i32 0, i32 0
%2 = load i8*, i8** %1, align 8, !tbaa !1
%3 = getelementptr inbounds %"class.std::vector", %"class.std::vector"* %v, i64 0, i32 0, i32 0, i32 1
%4 = load i8*, i8** %3, align 8, !tbaa !1
%5 = icmp eq i8* %2, %4
%6 = ptrtoint i8* %2 to i64
br i1 %5, label %._crit_edge, label %.lr.ph.preheader
.lr.ph.preheader: ; preds = %0
br label %.lr.ph
.lr.ph: ; preds = %.lr.ph.preheader, %9
%it.sroa.0.02 = phi i8* [ %10, %9 ], [ %2, %.lr.ph.preheader ]
%7 = load i8, i8* %it.sroa.0.02, align 1, !tbaa !5
%8 = icmp eq i8 %7, 0
br i1 %8, label %._crit_edge.loopexit, label %9
; <label>:9 ; preds = %.lr.ph
%10 = getelementptr inbounds i8, i8* %it.sroa.0.02, i64 1
%11 = icmp eq i8* %10, %4
br i1 %11, label %._crit_edge.loopexit, label %.lr.ph
._crit_edge.loopexit: ; preds = %9, %.lr.ph
%it.sroa.0.0.lcssa.ph = phi i8* [ %4, %9 ], [ %it.sroa.0.02, %.lr.ph ]
br label %._crit_edge
._crit_edge: ; preds = %._crit_edge.loopexit, %0
%it.sroa.0.0.lcssa = phi i8* [ %2, %0 ], [ %it.sroa.0.0.lcssa.ph, %._crit_edge.loopexit ]
%12 = ptrtoint i8* %it.sroa.0.0.lcssa to i64
%13 = sub i64 %12, %6
ret i64 %13
} |
@bluss is LLVM unable to remove the bound checking if indexes are used instead of raw pointers? Are we already tracking this in an issue (and/or an LLVM bug)? |
@ranma42 Do you mean that it should compute the equivalent slice and use an indexed loop over that instead? In that case I've already experimented with that kind of solution (example) and it was slower. (Yes bounds checks are removed in that formulation) I don't completely understand the question about using indexing in the slice iterator. Indexing doesn't work as well with the iterator (start, end pointer representation) as with the slice (pointer, length representation). |
I meant that it would be possible to use a start/end indexes pair (or even a slice itself) to track the not-yet-consumed part in the iterator. If LLVM was able to remove all of the bound checking, I would expect this to have good performance (as the ptr representation can trivially be obtained with a linear transformation, which IIRC is often applied to loop variables) |
I don't think that's a natural way to express these algorithms in the slice iterator. It then needs some code to fixup the start, end pointer pair (update from the indices) when it exits the loop. I was quite pleased with the code in search_while because it is so clear that it is "obviously" correct. Link to that line Ok, don't laugh 😄... Yes there's a macro, a closure call, a method call there on each line. But the methods and macros are introduced to make it neat, unsafe code should be neat and easy to review. |
@bluss For the gimli benchmarks, I'm definitely seeing that improvement for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
ping @sfackler , you're the reviewer here, but you haven't commented. Should we move this review to someone else? |
@bors r+ |
Relevant parallel issue that hasn't been resolved - A discussion if libc++ should adopt libstdc++'s unrolling for the C++ find. https://llvm.org/bugs/show_bug.cgi?id=19708 And look at that delicious Duff's device |
@bors: r=sfackler |
📌 Commit a54ddfb has been approved by |
Improve the slice iterator's searching methods Improve all, any, find, position, rposition by explicitly unrolling the loop for the slice iterators. - Introduce a few extension methods and functions for raw pointers make the new code easy to express - Introduce helper methods `search_while, rsearch_while` that generalize all the searching methods LLVM doesn't unroll the loop in `.find()` by default (clang is the same), so performance benefits a lot from explicit unrolling here. An iterator method without conditional exits (like `.fold()`) does not need this on the other hand. One of the raw pointer extension methods is `fn post_inc(&mut self) -> Self` which is the rustic equivalent of “`ptr++`”, and it is a nice way to express the raw pointer loop (see commit 3). Specific development notes about `search_while`: I tried both computing an end pointer "rounded" to 4, as well as the `ptrdistance >= 4` loop condition, ptrdistance was better. I tried handling the last 0-3 elements unrolled or with a while loop, the loop was better.
☀️ Test successful - status-appveyor, status-travis |
@bluss that moment when I click on a PR that I find interesting to see if I can improve it using SIMD and I follow your link to an LLVM bug report that I also found interesting in which surprisingly the last comment is from myself quoting a SIMD implementation of the algorithm 🤣 I've tried to think if we can use We would need to provide in std "generic predicates", like e.g. |
@gnzlbg The compiler making the decision about unrolling and vectorization is definitely preferable, since code size etc decisions can be made according to settings and context. Don't you think that autovectorization would follow as well if the compiler just learns to unroll this loop by itself? By the way, maybe you can improve on the memchr fallback that's in libstd if you are interested. |
@bluss I will look into the I agree that it would be better if auto-vectorization would just work but... Why do you think that it will ever work? All the data I have points that it never will: after 30 years of GCC and 15 years of LLVM, inlining, which is required for auto-vectorization, is believed to be an unsolvable problem, and auto-vectorization optimizations themselves still fail to reliably optimize even the most trivial loops. |
I think there are many examples where it works amazingly well. |
I should have emphasized the word reliably. |
Improve all, any, find, position, rposition by explicitly unrolling the loop for the slice iterators.
search_while, rsearch_while
that generalize all the searching methodsLLVM doesn't unroll the loop in
.find()
by default (clang is the same), so performance benefits a lot from explicit unrolling here. An iterator method without conditional exits (like.fold()
) does not need this on the other hand.One of the raw pointer extension methods is
fn post_inc(&mut self) -> Self
which is the rustic equivalent of “ptr++
”, and it is a nice way to express the raw pointer loop (see commit 3).Specific development notes about
search_while
: I tried both computing an end pointer "rounded" to 4, as well as theptrdistance >= 4
loop condition, ptrdistance was better. I tried handling the last 0-3 elements unrolled or with a while loop, the loop was better.