-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve slice::binary_search_by #127007
Improve slice::binary_search_by #127007
Conversation
7aa251c
to
e4a10ae
Compare
This comment has been minimized.
This comment has been minimized.
e4a10ae
to
a06e95e
Compare
This comment has been minimized.
This comment has been minimized.
I've run benches with |
27237e5
to
a545bb3
Compare
This comment has been minimized.
This comment has been minimized.
a545bb3
to
2f5eec9
Compare
A different strategy bore fruits! It relies on the fact that except for ZST, slice.len is less than isize::MAX and so (right + left)/2 is an ok way to compute the mean. By treating the ZST case separately, we can "revert" 3eb5bee and win some instructions. ZST comparing to Equal special caseAssembly:Before mov rdx, rsi
xor eax, eax
test rsi, rsi
sete al
shr rdx
ret After xor eax, eax
test rsi, rsi
sete al
xor edx, edx
ret usize on x86_64AssemblyBeforeexample::custom_binary_search_gt::hb56715903c41dc94:
mov eax, 1
test rsi, rsi
je .LBB0_1
mov rcx, rdx
mov rdx, rsi
mov rcx, qword ptr [rcx]
xor esi, esi
mov r8, rdx
.LBB0_3:
shr rdx
add rdx, rsi
mov r9, qword ptr [rdi + 8*rdx]
cmp r9, rcx
je .LBB0_4
lea r10, [rdx + 1]
cmp r9, rcx
cmova r8, rdx
cmovb rsi, r10
mov rdx, r8
sub rdx, rsi
ja .LBB0_3
mov rdx, rsi
ret
.LBB0_1:
xor edx, edx
ret
.LBB0_4:
xor eax, eax
ret Afterexample::custom_binary_search_1::hc0743890bf3b6f85:
mov rcx, qword ptr [rdx]
xor eax, eax
xor edx, edx
.LBB1_1:
cmp rsi, rdx
jbe .LBB1_2
lea r8, [rsi + rdx]
shr r8
lea r9, [r8 + 1]
cmp qword ptr [rdi + 8*r8], rcx
cmovb rdx, r9
cmova rsi, r8
jne .LBB1_1
mov rdx, r8
ret
.LBB1_2:
mov eax, 1
ret BenchesBefore
After
|
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
Improve slice::binary_search_by This PR aims to improve the performances of std::slice::binary_search. **EDIT: The proposed implementation changed so the rest of this comment is outdated. See rust-lang#127007 (comment) for an up to date presentation of the PR.** It reduces the total instruction count for the `u32` monomorphization, but maybe more remarkably, removes 2 of the 12 instructions of the main loop (on x86). It changes `test_binary_search_implementation_details()` so may warrant a crater run. I will document it much more if this is shown to be interesting on benchmarks. Could we start with a timer run first? **Before the PR** ```asm mov eax, 1 test rsi, rsi je .LBB0_1 mov rcx, rdx mov rdx, rsi mov ecx, dword ptr [rcx] xor esi, esi mov r8, rdx .LBB0_3: shr rdx add rdx, rsi mov r9d, dword ptr [rdi + 4*rdx] cmp r9d, ecx je .LBB0_4 lea r10, [rdx + 1] cmp r9d, ecx cmova r8, rdx cmovb rsi, r10 mov rdx, r8 sub rdx, rsi ja .LBB0_3 mov rdx, rsi ret .LBB0_1: xor edx, edx ret .LBB0_4: xor eax, eax ret ``` **After the PR** ```asm mov ecx, dword ptr [rdx] xor eax, eax xor edx, edx .LBB1_1: cmp rsi, 1 jbe .LBB1_2 mov r9, rsi shr r9 lea r8, [r9 + rdx] sub rsi, r9 cmp dword ptr [rdi + 4*r8], ecx cmovb rdx, r8 cmova rsi, r9 jne .LBB1_1 mov rdx, r8 ret .LBB1_2: test rsi, rsi je .LBB1_3 xor eax, eax cmp dword ptr [rdi + 4*rdx], ecx setne al adc rdx, 0 ret .LBB1_3: mov eax, 1 ret ```
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (3dc1b1e): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)Results (primary 0.0%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (primary -2.6%, secondary 2.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary -0.0%, secondary -0.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 694.857s -> 695.931s (0.15%) |
Hi, i discover this PR this morning, yesterday a post on zulip about some ideas on how to improve the STD binary search. If it is worth or usefull here is the post on zulip Proposal for Binary Search Algorithm Enhancing. I am not sure if is desirable, better or worst but if the algorith is under reviw maybe is helpful. |
2f5eec9
to
636412c
Compare
Except for ZSTs, `[T].len()` is less than `isize::MAX`. By treating the ZST case separately, we can "revert" 3eb5bee, remove the `size` local variable and win two instructions.
636412c
to
4093a4d
Compare
All benchmarks are good and I added comments, so this is ready for review. |
// SAFETY: We have that left < right, so | ||
// 0 <= left <= mid < right <= self.len() | ||
// and the indexing is in-bounds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within the safety comment, there could be an explicit mention of where the mid
relations come from.
As a reader, it's easier to evaluate the logic in the comment without needing to consult the line of code above at the same time.
// SAFETY: We have that left < right, so | |
// 0 <= left <= mid < right <= self.len() | |
// and the indexing is in-bounds. | |
// SAFETY: The while condition ensures left < right, | |
// so `left <= (left + right) / 2 < right`. Thus, | |
// 0 <= left <= mid < right <= self.len() | |
// and the indexing is in-bounds. |
This binary search implementation by @orlp might also be interesting to look at: https://orlp.net/blog/bitwise-binary-search/ |
@Miksel12 That article is mostly for educational purposes. I believe this to be the fastest pattern possible: https://mhdm.dev/posts/sb_lower_bound/#update (if you can get the compiler to generate |
The massive difference is that this version does not have an early return for the equality case. Removing this early return from our implementation would also bring the instruction count down but I don't think this is what we want. |
@krtab I don't think the early return is valuable. It only comes into play if you're exceptionally lucky, and then it only saves O(log n) iterations in the best case. It adds an extra branch to every iteration every time you're not exceptionally lucky. To be more precise, it is expected that the number of comparisons you save for random data is For what it's worth, in my experience you almost always want |
In #128254 I actually revert to the old algorithm which turns out to be much faster once you ensure that LLVM doesn't turn CMOV into branches. |
☔ The latest upstream changes (presumably #128254) made this pull request unmergeable. Please resolve the merge conflicts. |
This PR aims to improve the performances of std::slice::binary_search.
EDIT: The proposed implementation changed so the rest of this comment is outdated. See #127007 (comment) for an up to date presentation of the PR.
It reduces the total instruction count for the
u32
monomorphization, but maybe more remarkably, removes 2 of the 12 instructions of the main loop (on x86).It changes
test_binary_search_implementation_details()
so may warrant a crater run.I will document it much more if this is shown to be interesting on benchmarks. Could we start with a timer run first?
Before the PR
After the PR