Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance regression of binary_search #115271

Closed
kekeimiku opened this issue Aug 27, 2023 · 7 comments · Fixed by #128254
Closed

performance regression of binary_search #115271

kekeimiku opened this issue Aug 27, 2023 · 7 comments · Fixed by #128254
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@kekeimiku
Copy link

kekeimiku commented Aug 27, 2023

This affects my real project. [usize] seems to cause a huge performance regression.

After investigation, it is caused by the new binary_search introduced by this pr #74024 .

Why is the new binary_search not binary_search_unstable ?

macOS Pro M2

test new_binary_search_l1            ... bench:          38 ns/iter (+/- 0)
test new_binary_search_l1_with_dups  ... bench:          22 ns/iter (+/- 0)
test new_binary_search_l1_worst_case ... bench:           7 ns/iter (+/- 1)
test new_binary_search_l2            ... bench:          56 ns/iter (+/- 6)
test new_binary_search_l2_with_dups  ... bench:          35 ns/iter (+/- 0)
test new_binary_search_l2_worst_case ... bench:          11 ns/iter (+/- 0)
test new_binary_search_l3            ... bench:         104 ns/iter (+/- 1)
test new_binary_search_l3_with_dups  ... bench:          85 ns/iter (+/- 1)
test new_binary_search_l3_worst_case ... bench:          19 ns/iter (+/- 0)
test old_binary_search_l1            ... bench:           5 ns/iter (+/- 0)
test old_binary_search_l1_with_dups  ... bench:           5 ns/iter (+/- 0)
test old_binary_search_l1_worst_case ... bench:           4 ns/iter (+/- 0)
test old_binary_search_l2            ... bench:           8 ns/iter (+/- 0)
test old_binary_search_l2_with_dups  ... bench:           8 ns/iter (+/- 0)
test old_binary_search_l2_worst_case ... bench:           7 ns/iter (+/- 0)
test old_binary_search_l3            ... bench:          24 ns/iter (+/- 0)
test old_binary_search_l3_with_dups  ... bench:          25 ns/iter (+/- 6)
test old_binary_search_l3_worst_case ... bench:          12 ns/iter (+/- 0)
lib.rs

#![feature(core_intrinsics)]

use std::cmp::Ord;
use std::cmp::Ordering::{self, Equal, Greater, Less};

pub fn old_binary_search<T>(s: &[T], x: &T) -> Result<usize, usize>
where
    T: Ord,
{
    old_binary_search_by(s, |p| p.cmp(x))
}

#[inline]
pub fn old_binary_search_by<'a, T, F>(s: &'a [T], mut f: F) -> Result<usize, usize>
where
    F: FnMut(&'a T) -> Ordering,
{
    let mut size = s.len();
    if size == 0 {
        return Err(0);
    }
    let mut base = 0usize;
    while size > 1 {
        let half = size / 2;
        let mid = base + half;
        // mid is always in [0, size), that means mid is >= 0 and < size.
        // mid >= 0: by definition
        // mid < size: mid = size / 2 + size / 4 + size / 8 ...
        let cmp = f(unsafe { s.get_unchecked(mid) });
        base = if cmp == Greater { base } else { mid };
        size -= half;
    }
    // base is always in [0, size) because base <= mid.
    let cmp = f(unsafe { s.get_unchecked(base) });
    if cmp == Equal {
        Ok(base)
    } else {
        Err(base + (cmp == Less) as usize)
    }
}

pub fn new_binary_search<T>(s: &[T], x: &T) -> Result<usize, usize>
where
    T: Ord,
{
    new_binary_search_by(s, |p| p.cmp(x))
}

#[inline]
pub fn new_binary_search_by<'a, T, F>(s: &'a [T], mut f: F) -> Result<usize, usize>
where
    F: FnMut(&'a T) -> Ordering,
{
    // INVARIANTS:
    // - 0 <= left <= left + size = right <= self.len()
    // - f returns Less for everything in self[..left]
    // - f returns Greater for everything in self[right..]
    let mut size = s.len();
    let mut left = 0;
    let mut right = size;
    while left < right {
        let mid = left + size / 2;

        // SAFETY: the while condition means `size` is strictly positive, so
        // `size/2 < size`. Thus `left + size/2 < left + size`, which
        // coupled with the `left + size <= self.len()` invariant means
        // we have `left + size/2 < self.len()`, and this is in-bounds.
        let cmp = f(unsafe { s.get_unchecked(mid) });

        // The reason why we use if/else control flow rather than match
        // is because match reorders comparison operations, which is perf sensitive.
        // This is x86 asm for u8: https://rust.godbolt.org/z/8Y8Pra.
        if cmp == Less {
            left = mid + 1;
        } else if cmp == Greater {
            right = mid;
        } else {
            // SAFETY: same as the `get_unchecked` above
            unsafe { core::intrinsics::assume(mid < s.len()) };
            return Ok(mid);
        }

        size = right - left;
    }

    // SAFETY: directly true from the overall invariant.
    // Note that this is `<=`, unlike the assume in the `Ok` path.
    unsafe { core::intrinsics::assume(left <= s.len()) };
    Err(left)
}

bench.rs

  #![feature(test)]
extern crate test;

use test::black_box;
use test::Bencher;

use binary_search_bench::*;

enum Cache {
    L1,
    L2,
    L3,
}

fn old_bench_binary_search<F>(b: &mut Bencher, cache: Cache, mapper: F)
where
    F: Fn(usize) -> usize,
{
    let size = match cache {
        Cache::L1 => 1000,      // 8kb
        Cache::L2 => 10_000,    // 80kb
        Cache::L3 => 1_000_000, // 8Mb
    };
    let v = (0..size).map(&mapper).collect::<Vec<_>>();
    let mut r = 0usize;
    b.iter(move || {
        // LCG constants from https://en.wikipedia.org/wiki/Numerical_Recipes.
        r = r.wrapping_mul(1664525).wrapping_add(1013904223);
        // Lookup the whole range to get 50% hits and 50% misses.
        let i = mapper(r % size);
        black_box(old_binary_search(&v, &i).is_ok());
    })
}

fn old_bench_binary_search_worst_case(b: &mut Bencher, cache: Cache) {
    let size = match cache {
        Cache::L1 => 1000,      // 8kb
        Cache::L2 => 10_000,    // 80kb
        Cache::L3 => 1_000_000, // 8Mb
    };
    let mut v = vec![0; size];
    let i = 1;
    v[size - 1] = i;
    b.iter(move || {
        black_box(old_binary_search(&v, &i).is_ok());
    })
}

#[bench]
fn old_binary_search_l1(b: &mut Bencher) {
    old_bench_binary_search(b, Cache::L1, |i| i * 2);
}

#[bench]
fn old_binary_search_l2(b: &mut Bencher) {
    old_bench_binary_search(b, Cache::L2, |i| i * 2);
}

#[bench]
fn old_binary_search_l3(b: &mut Bencher) {
    old_bench_binary_search(b, Cache::L3, |i| i * 2);
}

#[bench]
fn old_binary_search_l1_with_dups(b: &mut Bencher) {
    old_bench_binary_search(b, Cache::L1, |i| i / 16 * 16);
}

#[bench]
fn old_binary_search_l2_with_dups(b: &mut Bencher) {
    old_bench_binary_search(b, Cache::L2, |i| i / 16 * 16);
}

#[bench]
fn old_binary_search_l3_with_dups(b: &mut Bencher) {
    old_bench_binary_search(b, Cache::L3, |i| i / 16 * 16);
}

#[bench]
fn old_binary_search_l1_worst_case(b: &mut Bencher) {
    old_bench_binary_search_worst_case(b, Cache::L1);
}

#[bench]
fn old_binary_search_l2_worst_case(b: &mut Bencher) {
    old_bench_binary_search_worst_case(b, Cache::L2);
}

#[bench]
fn old_binary_search_l3_worst_case(b: &mut Bencher) {
    old_bench_binary_search_worst_case(b, Cache::L3);
}

fn new_bench_binary_search<F>(b: &mut Bencher, cache: Cache, mapper: F)
where
    F: Fn(usize) -> usize,
{
    let size = match cache {
        Cache::L1 => 1000,      // 8kb
        Cache::L2 => 10_000,    // 80kb
        Cache::L3 => 1_000_000, // 8Mb
    };
    let v = (0..size).map(&mapper).collect::<Vec<_>>();
    let mut r = 0usize;
    b.iter(move || {
        // LCG constants from https://en.wikipedia.org/wiki/Numerical_Recipes.
        r = r.wrapping_mul(1664525).wrapping_add(1013904223);
        // Lookup the whole range to get 50% hits and 50% misses.
        let i = mapper(r % size);
        black_box(new_binary_search(&v, &i).is_ok());
    })
}

fn new_bench_binary_search_worst_case(b: &mut Bencher, cache: Cache) {
    let size = match cache {
        Cache::L1 => 1000,      // 8kb
        Cache::L2 => 10_000,    // 80kb
        Cache::L3 => 1_000_000, // 8Mb
    };
    let mut v = vec![0; size];
    let i = 1;
    v[size - 1] = i;
    b.iter(move || {
        black_box(new_binary_search(&v, &i).is_ok());
    })
}

#[bench]
fn new_binary_search_l1(b: &mut Bencher) {
    new_bench_binary_search(b, Cache::L1, |i| i * 2);
}

#[bench]
fn new_binary_search_l2(b: &mut Bencher) {
    new_bench_binary_search(b, Cache::L2, |i| i * 2);
}

#[bench]
fn new_binary_search_l3(b: &mut Bencher) {
    new_bench_binary_search(b, Cache::L3, |i| i * 2);
}

#[bench]
fn new_binary_search_l1_with_dups(b: &mut Bencher) {
    new_bench_binary_search(b, Cache::L1, |i| i / 16 * 16);
}

#[bench]
fn new_binary_search_l2_with_dups(b: &mut Bencher) {
    new_bench_binary_search(b, Cache::L2, |i| i / 16 * 16);
}

#[bench]
fn new_binary_search_l3_with_dups(b: &mut Bencher) {
    new_bench_binary_search(b, Cache::L3, |i| i / 16 * 16);
}

#[bench]
fn new_binary_search_l1_worst_case(b: &mut Bencher) {
    new_bench_binary_search_worst_case(b, Cache::L1);
}

#[bench]
fn new_binary_search_l2_worst_case(b: &mut Bencher) {
    new_bench_binary_search_worst_case(b, Cache::L2);
}

#[bench]
fn new_binary_search_l3_worst_case(b: &mut Bencher) {
    new_bench_binary_search_worst_case(b, Cache::L3);
}

```[tasklist] ### Tasks ```
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Aug 27, 2023
@kekeimiku kekeimiku changed the title Significant slowdown of binary_search performance regression of binary_search Aug 27, 2023
@the8472
Copy link
Member

the8472 commented Aug 27, 2023

If your use-case permits it you can try partition_point instead which doesn't check the equality case. Its limitation is that it doesn't distinguish the presence/absence of the element at the returned index.

@saethlin saethlin added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Aug 28, 2023
@kekeimiku
Copy link
Author

There seems to be a huge performance regression only in aarch64.
x86_64 i7-9750H
2023-11-24 14 54 22

@okaneco
Copy link
Contributor

okaneco commented Nov 24, 2023

I haven't run your benchmark yet, but #117722 was merged which might reclaim some performance.

Using old_binary_search_by from your lib.rs,
New BSearch vs. #117722 BSearch - https://rust.godbolt.org/z/qEfPPsY59
Old BSearch vs. #117722 BSearch - https://rust.godbolt.org/z/8qE13cnvW

@kekeimiku
Copy link
Author

@okaneco Hi, thank you very much for your contribution.

Here are the new benchmarks. (MacBook Pro M2).

It looks like 117722 is faster than new_binary_search, but still not as good as old_binary_search.

1.76.0-nightly (37b2813a7 2023-11-24)
cargo bench
test binary_search_117722_l1            ... bench:          18 ns/iter (+/- 1)
test binary_search_117722_l1_with_dups  ... bench:          10 ns/iter (+/- 0)
test binary_search_117722_l1_worst_case ... bench:           7 ns/iter (+/- 0)
test binary_search_117722_l2            ... bench:          27 ns/iter (+/- 1)
test binary_search_117722_l2_with_dups  ... bench:          18 ns/iter (+/- 0)
test binary_search_117722_l2_worst_case ... bench:          13 ns/iter (+/- 0)
test binary_search_117722_l3            ... bench:          60 ns/iter (+/- 0)
test binary_search_117722_l3_with_dups  ... bench:          43 ns/iter (+/- 0)
test binary_search_117722_l3_worst_case ... bench:          24 ns/iter (+/- 0)
test new_binary_search_l1               ... bench:          39 ns/iter (+/- 0)
test new_binary_search_l1_with_dups     ... bench:          22 ns/iter (+/- 0)
test new_binary_search_l1_worst_case    ... bench:           7 ns/iter (+/- 0)
test new_binary_search_l2               ... bench:          55 ns/iter (+/- 0)
test new_binary_search_l2_with_dups     ... bench:          36 ns/iter (+/- 0)
test new_binary_search_l2_worst_case    ... bench:          11 ns/iter (+/- 0)
test new_binary_search_l3               ... bench:         105 ns/iter (+/- 0)
test new_binary_search_l3_with_dups     ... bench:          85 ns/iter (+/- 2)
test new_binary_search_l3_worst_case    ... bench:          19 ns/iter (+/- 0)
test old_binary_search_l1               ... bench:           5 ns/iter (+/- 0)
test old_binary_search_l1_with_dups     ... bench:           5 ns/iter (+/- 0)
test old_binary_search_l1_worst_case    ... bench:           4 ns/iter (+/- 0)
test old_binary_search_l2               ... bench:           8 ns/iter (+/- 0)
test old_binary_search_l2_with_dups     ... bench:           8 ns/iter (+/- 0)
test old_binary_search_l2_worst_case    ... bench:           7 ns/iter (+/- 0)
test old_binary_search_l3               ... bench:          24 ns/iter (+/- 0)
test old_binary_search_l3_with_dups     ... bench:          24 ns/iter (+/- 2)
test old_binary_search_l3_worst_case    ... bench:          12 ns/iter (+/- 0)
lib.rs
#![feature(core_intrinsics)]

use std::cmp::Ord;
use std::cmp::Ordering::{self, Equal, Greater, Less};

pub fn old_binary_search<T>(s: &[T], x: &T) -> Result<usize, usize>
where
    T: Ord,
{
    old_binary_search_by(s, |p| p.cmp(x))
}

#[inline(always)]
pub fn old_binary_search_by<'a, T, F>(s: &'a [T], mut f: F) -> Result<usize, usize>
where
    F: FnMut(&'a T) -> Ordering,
{
    let mut size = s.len();
    if size == 0 {
        return Err(0);
    }
    let mut base = 0usize;
    while size > 1 {
        let half = size / 2;
        let mid = base + half;
        // mid is always in [0, size), that means mid is >= 0 and < size.
        // mid >= 0: by definition
        // mid < size: mid = size / 2 + size / 4 + size / 8 ...
        let cmp = f(unsafe { s.get_unchecked(mid) });
        base = if cmp == Greater { base } else { mid };
        size -= half;
    }
    // base is always in [0, size) because base <= mid.
    let cmp = f(unsafe { s.get_unchecked(base) });
    if cmp == Equal {
        Ok(base)
    } else {
        Err(base + (cmp == Less) as usize)
    }
}

pub fn new_binary_search<T>(s: &[T], x: &T) -> Result<usize, usize>
where
    T: Ord,
{
    new_binary_search_by(s, |p| p.cmp(x))
}

#[inline(always)]
pub fn new_binary_search_by<'a, T, F>(s: &'a [T], mut f: F) -> Result<usize, usize>
where
    F: FnMut(&'a T) -> Ordering,
{
    // INVARIANTS:
    // - 0 <= left <= left + size = right <= self.len()
    // - f returns Less for everything in self[..left]
    // - f returns Greater for everything in self[right..]
    let mut size = s.len();
    let mut left = 0;
    let mut right = size;
    while left < right {
        let mid = left + size / 2;

        // SAFETY: the while condition means `size` is strictly positive, so
        // `size/2 < size`. Thus `left + size/2 < left + size`, which
        // coupled with the `left + size <= self.len()` invariant means
        // we have `left + size/2 < self.len()`, and this is in-bounds.
        let cmp = f(unsafe { s.get_unchecked(mid) });

        // The reason why we use if/else control flow rather than match
        // is because match reorders comparison operations, which is perf sensitive.
        // This is x86 asm for u8: https://rust.godbolt.org/z/8Y8Pra.
        if cmp == Less {
            left = mid + 1;
        } else if cmp == Greater {
            right = mid;
        } else {
            return Ok(mid);
        }

        size = right - left;
    }
    Err(left)
}

pub fn binary_search_117722<T>(s: &[T], x: &T) -> Result<usize, usize>
where
    T: Ord,
{
    binary_search_by_117722(s, |p| p.cmp(x))
}

#[inline(always)]
pub fn binary_search_by_117722<'a, F, T>(s: &'a [T], mut f: F) -> Result<usize, usize>
where
    F: FnMut(&'a T) -> Ordering,
{
    // INVARIANTS:
    // - 0 <= left <= left + size = right <= self.len()
    // - f returns Less for everything in self[..left]
    // - f returns Greater for everything in self[right..]
    let mut size = s.len();
    let mut left = 0;
    let mut right = size;
    while left < right {
        let mid = left + size / 2;

        // SAFETY: the while condition means `size` is strictly positive, so
        // `size/2 < size`. Thus `left + size/2 < left + size`, which
        // coupled with the `left + size <= self.len()` invariant means
        // we have `left + size/2 < self.len()`, and this is in-bounds.
        let cmp = f(unsafe { s.get_unchecked(mid) });

        // This control flow produces conditional moves, which results in
        // fewer branches and instructions than if/else or matching on
        // cmp::Ordering.
        // This is x86 asm for u8: https://rust.godbolt.org/z/698eYffTx.
        left = if cmp == Less { mid + 1 } else { left };
        right = if cmp == Greater { mid } else { right };
        if cmp == Equal {
            // SAFETY: same as the `get_unchecked` above
            unsafe { std::intrinsics::assume(mid < s.len()) };
            return Ok(mid);
        }

        size = right - left;
    }

    // SAFETY: directly true from the overall invariant.
    // Note that this is `<=`, unlike the assume in the `Ok` path.
    unsafe { std::intrinsics::assume(left <= s.len()) };
    Err(left)
}
bench.rs
#![feature(test)]
extern crate test;

use test::black_box;
use test::Bencher;

use binary_search_bench::*;

enum Cache {
    L1,
    L2,
    L3,
}

fn old_bench_binary_search<F>(b: &mut Bencher, cache: Cache, mapper: F)
where
    F: Fn(usize) -> usize,
{
    let size = match cache {
        Cache::L1 => 1000,      // 8kb
        Cache::L2 => 10_000,    // 80kb
        Cache::L3 => 1_000_000, // 8Mb
    };
    let v = (0..size).map(&mapper).collect::<Vec<_>>();
    let mut r = 0usize;
    b.iter(move || {
        // LCG constants from https://en.wikipedia.org/wiki/Numerical_Recipes.
        r = r.wrapping_mul(1664525).wrapping_add(1013904223);
        // Lookup the whole range to get 50% hits and 50% misses.
        let i = mapper(r % size);
        black_box(old_binary_search(&v, &i).is_ok());
    })
}

fn old_bench_binary_search_worst_case(b: &mut Bencher, cache: Cache) {
    let size = match cache {
        Cache::L1 => 1000,      // 8kb
        Cache::L2 => 10_000,    // 80kb
        Cache::L3 => 1_000_000, // 8Mb
    };
    let mut v = vec![0; size];
    let i = 1;
    v[size - 1] = i;
    b.iter(move || {
        black_box(old_binary_search(&v, &i).is_ok());
    })
}

#[bench]
fn old_binary_search_l1(b: &mut Bencher) {
    old_bench_binary_search(b, Cache::L1, |i| i * 2);
}

#[bench]
fn old_binary_search_l2(b: &mut Bencher) {
    old_bench_binary_search(b, Cache::L2, |i| i * 2);
}

#[bench]
fn old_binary_search_l3(b: &mut Bencher) {
    old_bench_binary_search(b, Cache::L3, |i| i * 2);
}

#[bench]
fn old_binary_search_l1_with_dups(b: &mut Bencher) {
    old_bench_binary_search(b, Cache::L1, |i| i / 16 * 16);
}

#[bench]
fn old_binary_search_l2_with_dups(b: &mut Bencher) {
    old_bench_binary_search(b, Cache::L2, |i| i / 16 * 16);
}

#[bench]
fn old_binary_search_l3_with_dups(b: &mut Bencher) {
    old_bench_binary_search(b, Cache::L3, |i| i / 16 * 16);
}

#[bench]
fn old_binary_search_l1_worst_case(b: &mut Bencher) {
    old_bench_binary_search_worst_case(b, Cache::L1);
}

#[bench]
fn old_binary_search_l2_worst_case(b: &mut Bencher) {
    old_bench_binary_search_worst_case(b, Cache::L2);
}

#[bench]
fn old_binary_search_l3_worst_case(b: &mut Bencher) {
    old_bench_binary_search_worst_case(b, Cache::L3);
}

fn new_bench_binary_search<F>(b: &mut Bencher, cache: Cache, mapper: F)
where
    F: Fn(usize) -> usize,
{
    let size = match cache {
        Cache::L1 => 1000,      // 8kb
        Cache::L2 => 10_000,    // 80kb
        Cache::L3 => 1_000_000, // 8Mb
    };
    let v = (0..size).map(&mapper).collect::<Vec<_>>();
    let mut r = 0usize;
    b.iter(move || {
        // LCG constants from https://en.wikipedia.org/wiki/Numerical_Recipes.
        r = r.wrapping_mul(1664525).wrapping_add(1013904223);
        // Lookup the whole range to get 50% hits and 50% misses.
        let i = mapper(r % size);
        black_box(new_binary_search(&v, &i).is_ok());
    })
}

fn new_bench_binary_search_worst_case(b: &mut Bencher, cache: Cache) {
    let size = match cache {
        Cache::L1 => 1000,      // 8kb
        Cache::L2 => 10_000,    // 80kb
        Cache::L3 => 1_000_000, // 8Mb
    };
    let mut v = vec![0; size];
    let i = 1;
    v[size - 1] = i;
    b.iter(move || {
        black_box(new_binary_search(&v, &i).is_ok());
    })
}

#[bench]
fn new_binary_search_l1(b: &mut Bencher) {
    new_bench_binary_search(b, Cache::L1, |i| i * 2);
}

#[bench]
fn new_binary_search_l2(b: &mut Bencher) {
    new_bench_binary_search(b, Cache::L2, |i| i * 2);
}

#[bench]
fn new_binary_search_l3(b: &mut Bencher) {
    new_bench_binary_search(b, Cache::L3, |i| i * 2);
}

#[bench]
fn new_binary_search_l1_with_dups(b: &mut Bencher) {
    new_bench_binary_search(b, Cache::L1, |i| i / 16 * 16);
}

#[bench]
fn new_binary_search_l2_with_dups(b: &mut Bencher) {
    new_bench_binary_search(b, Cache::L2, |i| i / 16 * 16);
}

#[bench]
fn new_binary_search_l3_with_dups(b: &mut Bencher) {
    new_bench_binary_search(b, Cache::L3, |i| i / 16 * 16);
}

#[bench]
fn new_binary_search_l1_worst_case(b: &mut Bencher) {
    new_bench_binary_search_worst_case(b, Cache::L1);
}

#[bench]
fn new_binary_search_l2_worst_case(b: &mut Bencher) {
    new_bench_binary_search_worst_case(b, Cache::L2);
}

#[bench]
fn new_binary_search_l3_worst_case(b: &mut Bencher) {
    new_bench_binary_search_worst_case(b, Cache::L3);
}

fn bench_binary_search_117722<F>(b: &mut Bencher, cache: Cache, mapper: F)
where
    F: Fn(usize) -> usize,
{
    let size = match cache {
        Cache::L1 => 1000,      // 8kb
        Cache::L2 => 10_000,    // 80kb
        Cache::L3 => 1_000_000, // 8Mb
    };
    let v = (0..size).map(&mapper).collect::<Vec<_>>();
    let mut r = 0usize;
    b.iter(move || {
        // LCG constants from https://en.wikipedia.org/wiki/Numerical_Recipes.
        r = r.wrapping_mul(1664525).wrapping_add(1013904223);
        // Lookup the whole range to get 50% hits and 50% misses.
        let i = mapper(r % size);
        black_box(binary_search_117722(&v, &i).is_ok());
    })
}

fn bench_binary_search_117722_worst_case(b: &mut Bencher, cache: Cache) {
    let size = match cache {
        Cache::L1 => 1000,      // 8kb
        Cache::L2 => 10_000,    // 80kb
        Cache::L3 => 1_000_000, // 8Mb
    };
    let mut v = vec![0; size];
    let i = 1;
    v[size - 1] = i;
    b.iter(move || {
        black_box(binary_search_117722(&v, &i).is_ok());
    })
}

#[bench]
fn binary_search_117722_l1(b: &mut Bencher) {
    bench_binary_search_117722(b, Cache::L1, |i| i * 2);
}

#[bench]
fn binary_search_117722_l2(b: &mut Bencher) {
    bench_binary_search_117722(b, Cache::L2, |i| i * 2);
}

#[bench]
fn binary_search_117722_l3(b: &mut Bencher) {
    bench_binary_search_117722(b, Cache::L3, |i| i * 2);
}

#[bench]
fn binary_search_117722_l1_with_dups(b: &mut Bencher) {
    bench_binary_search_117722(b, Cache::L1, |i| i / 16 * 16);
}

#[bench]
fn binary_search_117722_l2_with_dups(b: &mut Bencher) {
    bench_binary_search_117722(b, Cache::L2, |i| i / 16 * 16);
}

#[bench]
fn binary_search_117722_l3_with_dups(b: &mut Bencher) {
    bench_binary_search_117722(b, Cache::L3, |i| i / 16 * 16);
}

#[bench]
fn binary_search_117722_l1_worst_case(b: &mut Bencher) {
    bench_binary_search_117722_worst_case(b, Cache::L1);
}

#[bench]
fn binary_search_117722_l2_worst_case(b: &mut Bencher) {
    bench_binary_search_117722_worst_case(b, Cache::L2);
}

#[bench]
fn binary_search_117722_l3_worst_case(b: &mut Bencher) {
    bench_binary_search_117722_worst_case(b, Cache::L3);
}
Cargo.toml
[package]
name = "binary_search_bench"
version = "0.1.0"
edition = "2021"

[profile.bench]
opt-level = 3
lto = true
codegen-units = 1
panic = "abort"
strip = true
debug = false

@kekeimiku
Copy link
Author

x86_64 i7-9750H.

2023-11-25 21 21 08

@Amanieu
Copy link
Member

Amanieu commented Jul 26, 2024

See #128254 for a potential fix.

bors added a commit to rust-lang-ci/rust that referenced this issue Jul 30, 2024
Rewrite binary search implementation

This PR builds on top of rust-lang#128250, which should be merged first.

This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
Fixes rust-lang#115271
@kekeimiku
Copy link
Author

#128254 (MacBook M3)

Just tested it locally, and it is undoubtedly better than all binary_search ever

test binary_search_117722_l1            ... bench:          15.42 ns/iter (+/- 0.05)
test binary_search_117722_l1_with_dups  ... bench:           9.31 ns/iter (+/- 0.26)
test binary_search_117722_l1_worst_case ... bench:           6.12 ns/iter (+/- 0.10)
test binary_search_117722_l2            ... bench:          23.26 ns/iter (+/- 0.29)
test binary_search_117722_l2_with_dups  ... bench:          16.02 ns/iter (+/- 0.43)
test binary_search_117722_l2_worst_case ... bench:          11.10 ns/iter (+/- 0.10)
test binary_search_117722_l3            ... bench:          54.34 ns/iter (+/- 0.97)
test binary_search_117722_l3_with_dups  ... bench:          38.40 ns/iter (+/- 0.84)
test binary_search_117722_l3_worst_case ... bench:          20.88 ns/iter (+/- 0.16)
test binary_search_128254_l1            ... bench:           5.22 ns/iter (+/- 0.37)
test binary_search_128254_l1_with_dups  ... bench:           5.30 ns/iter (+/- 0.37)
test binary_search_128254_l1_worst_case ... bench:           4.69 ns/iter (+/- 0.02)
test binary_search_128254_l2            ... bench:           7.96 ns/iter (+/- 0.62)
test binary_search_128254_l2_with_dups  ... bench:           8.26 ns/iter (+/- 0.28)
test binary_search_128254_l2_worst_case ... bench:           7.77 ns/iter (+/- 0.58)
test binary_search_128254_l3            ... bench:          25.39 ns/iter (+/- 0.25)
test binary_search_128254_l3_with_dups  ... bench:          23.24 ns/iter (+/- 0.30)
test binary_search_128254_l3_worst_case ... bench:          12.26 ns/iter (+/- 0.91)
test binary_search_74024_l1             ... bench:          34.97 ns/iter (+/- 1.36)
test binary_search_74024_l1_with_dups   ... bench:          20.12 ns/iter (+/- 1.48)
test binary_search_74024_l1_worst_case  ... bench:           5.95 ns/iter (+/- 0.30)
test binary_search_74024_l2             ... bench:          49.32 ns/iter (+/- 2.69)
test binary_search_74024_l2_with_dups   ... bench:          34.58 ns/iter (+/- 2.95)
test binary_search_74024_l2_worst_case  ... bench:           9.62 ns/iter (+/- 0.44)
test binary_search_74024_l3             ... bench:         104.63 ns/iter (+/- 12.64)
test binary_search_74024_l3_with_dups   ... bench:          79.60 ns/iter (+/- 7.00)
test binary_search_74024_l3_worst_case  ... bench:          15.58 ns/iter (+/- 0.13)
test old_binary_search_l1               ... bench:           5.30 ns/iter (+/- 0.06)
test old_binary_search_l1_with_dups     ... bench:           5.31 ns/iter (+/- 0.02)
test old_binary_search_l1_worst_case    ... bench:           4.69 ns/iter (+/- 0.02)
test old_binary_search_l2               ... bench:           8.51 ns/iter (+/- 0.62)
test old_binary_search_l2_with_dups     ... bench:           8.51 ns/iter (+/- 0.04)
test old_binary_search_l2_worst_case    ... bench:           7.78 ns/iter (+/- 0.04)
test old_binary_search_l3               ... bench:          25.41 ns/iter (+/- 0.74)
test old_binary_search_l3_with_dups     ... bench:          23.24 ns/iter (+/- 0.18)
test old_binary_search_l3_worst_case    ... bench:          12.79 ns/iter (+/- 0.92)

@bors bors closed this as completed in 1932602 Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants