Fix vectorized ranges::find
with unreachable_sentinel
to properly mask the beginning and handle unaligned pointers
#4450
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #4449.
For unsized
ranges::find
,vector_algorithms.cpp
reads elements "before the beginning", although (hopefully) not in a way that annoys memory page protection or ASAN. Then we "we mask out matches that don't belong to the range", pretending that they weren't there. This allows us to use vectorized loads for the entire unbounded loop, instead of starting with classic scalar code until we get to a nicely-aligned boundary.However, we had a control flow bug. We started with "load" a vector chunk, "mask" away matches before the beginning, "check" if we found anything (and return if so). Then we started our infinite loop:
for (;;) { load; check; advance; }
. But this repeated the initial load and check, without the mask! It was also unnecessary extra work.A minimal correctness fix would be to cycle around the "advance" step:
load; mask; check; for (;;) { advance; load; check; }
. But we can do even better - the "check" steps are exactly identical between the first part and the infinite loop. So we can cycle the loop a bit more:load; mask; for (;;) { check; advance; load; }
avoids having to repeat the "check". (The "load" steps are also exactly identical, but there's no easy way to fuse them, given that we want to mask only the first one.)Also fixes #4454.
Unlike the other vectorized algorithms, find-unsized uses aligned loads, so it requires that its N-byte elements are N-aligned. This is notoriously untrue on x86, where 8-byte elements can appear on a 4-aligned stack. Packed structs can also subvert this assumption.
We can simply test whether the pointer is properly aligned. I was able to fix this with the following control flow:
Note that (unlike some other vectorized algorithms) our AVX2 and SSE-n codepaths here are always-return, so chaining them with
else if
is fine. I thought this was less disruptive than increasing the level of control flow nesting.