Suboptimal performance of BinaryHeap::append #77433

hanmertens · 2020-10-01T22:27:37Z

The current implementation of BinaryHeap::append uses a heuristic based on the worst-case number of comparisons to determine between two strategies:

Calling self.extend(other.drain()), equivalent to pushing all elements of the other heap.
Appending the vector of the other heap and rebuilding the heap from scratch.

I've done some benchmarking, and it turns out that method 1 (based on extend) is always faster for two heaps based on randomly shuffled data – on the computers I've tested it on, anyway. I've included images of the benchmarks below: first the current append strategy, then the extend strategy (method 1). In the benchmarks, two heaps are merged, one containing the number of elements on the x-axis, the other containing 100,000 minus that number (for the rightmost data points both heaps are equal in size). The red line (sometimes hiding behind the green one) is in both images that corresponding to std::collections::BinaryHeap, the other lines are not relevant to this issue (they aren't standard library types).

From the jump in performance in the first graph, you can clearly see when the switch between the two approaches happens. Graphs are similar for smaller heaps, I didn't test larger ones. It's possible method 2 is faster under other circumstances (maybe if one of the heaps contains mostly very small/large elements?), especially if both heaps are (almost) equal in size. However, I think the benchmark is more close to the "average" real-world use case than such a case would be (I'd be happy to be proven wrong about this, though).

Simplification of the benchmark that only runs the parts relevant for this issue

`Cargo.toml`

[package]
name = "binary_heap_bench"
version = "0.1.0"
edition = "2018"

[dependencies]
criterion = "0.3"
rand = "0.7"

[[bench]]
name = "benchmark"
harness = false

`benches/benchmark.rs`

use criterion::*;
use rand::{seq::SliceRandom, thread_rng};
use std::collections::BinaryHeap;
use std::iter::FromIterator;
use std::mem;

/// Data type we want to use
type T = u32;

/// Produce shuffled vector containing values 0..n
fn random_data(n: T) -> Vec<T> {
    let mut data = Vec::from_iter(0..n);
    data.shuffle(&mut thread_rng());
    data
}

fn two_heaps<H: From<Vec<T>>>(mut vec1: Vec<T>, i: usize) -> (H, H) {
    let vec2 = vec1.split_off(i);
    (vec1.into(), vec2.into())
}

fn std_heap_append((mut heap1, mut heap2): (BinaryHeap<T>, BinaryHeap<T>)) -> BinaryHeap<T> {
    heap1.append(&mut heap2);
    heap1
}

fn std_heap_extend((mut heap1, mut heap2): (BinaryHeap<T>, BinaryHeap<T>)) -> BinaryHeap<T> {
    if heap1.len() < heap2.len() {
        mem::swap(&mut heap1, &mut heap2);
    }
    heap1.extend(heap2.drain());
    heap1
}

fn benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("append/extend");
    let base = 100000;
    let step = 2500;
    for i in (step..=base as usize / 2).step_by(step) {
        let size = BatchSize::SmallInput;
        group.bench_function(BenchmarkId::new("append", i), |b| {
            b.iter_batched(|| two_heaps(random_data(base), i), std_heap_append, size)
        });
        group.bench_function(BenchmarkId::new("extend", i), |b| {
            b.iter_batched(|| two_heaps(random_data(base), i), std_heap_extend, size)
        });
    }
}

criterion_group!(benches, benchmark);
criterion_main!(benches);

The text was updated successfully, but these errors were encountered:

the8472 · 2020-10-02T18:32:43Z

The comment suggests that the rebuild strategy is more efficient if at least one of the following is true:

comparisons are very expensive
the appended heap is much much larger than self but self must still contain a handful of elements so that the logarithm is > 2

You can see how the slope of the append method is lower, so its advantage would show if you extended your benchmark by an order of magnitude. If you used a more expensive comparison function it would probably show earlier too.

hanmertens · 2020-10-02T20:01:50Z

The code for append swaps the heaps if other is larger than self, so the appended heap is always smaller than or equal to the self heap. Indeed, the slope of the rebuild method is smaller, but I can't find conditions where that is outweighed by the higher base cost of the method.

To simulate more expensive comparisons I've made a benchmark where each comparison involves some global atomic operations. Below that I've included a benchmark with ten times the number of items in the heaps in total (an order of magnitude more than that makes the benchmark very slow). The trends remain similar.

The original implementation (#32526 and #32987) seems to base its choice for heuristic mostly on the expectation it is a reasonable heuristic. However, I think reasoning from the worst-case scenario (O(n log(m))) is pessimistic for the extend method as a large number of pushes require only one or two comparisons (instead of log(m)) before ending up in the bottom two layers (that contain 75% of all items in the end). Maybe for even larger heaps the rebuild method actually is advantageous again, but I'm not sure how to determine a heuristic for that.

Changed code for atomic benchmark

use std::cmp::Ordering;
use std::sync::atomic::{self, AtomicU64};

static C: AtomicU64 = AtomicU64::new(0);

fn expensive() {
    for _ in 0..10 {
        C.fetch_add(1, atomic::Ordering::Relaxed);
        C.fetch_sub(1, atomic::Ordering::Relaxed);
    }
}

struct T(u64);

impl PartialEq for T {
    fn eq(&self, other: &Self) -> bool {
        self.0.eq(&other.0)
    }
}

impl Eq for T {}

impl PartialOrd for T {
    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
        expensive();
        self.0.partial_cmp(&other.0)
    }
}

impl Ord for T {
    fn cmp(&self, other: &Self) -> Ordering {
        expensive();
        self.0.cmp(&other.0)
    }
}

fn random_data(n: u64) -> Vec<T> {
    let mut data = Vec::from_iter((0..n).map(T));
    data.shuffle(&mut thread_rng());
    data
}

the8472 · 2020-10-02T20:25:29Z

The code for append swaps the heaps if other is larger than self, so the appended heap is always smaller than or equal to the self heap.

Ah, I missed that. That indeed limits the opportunities to realize its advantage.

Below that I've included a benchmark with ten times the number of items in the heaps in total (an order of magnitude more than that makes the benchmark very slow). The trends remain similar.

The x-axis still says 50k though?

Maybe for even larger heaps the rebuild method actually is advantageous again, but I'm not sure how to determine a heuristic for that.

First we would need a benchmark demonstrating that advantage anyway.

hanmertens · 2020-10-02T21:09:42Z

The x-axis still says 50k though?

Only for the first image, that one has the same number of elements as before but a more expansive comparison implementation. The x-axis of the second image goes up to 500k, that one has normal "cheap" integer comparison but a larger number of items in the heaps.

First we would need a benchmark demonstrating that advantage anyway.

Certainly.

scottmcm · 2020-10-03T20:04:56Z

Naively, when I look at the graphs in the OP what I see isn't that extend is always going to be better -- the slopes seem to be clearly different, though two graphs with different scales and extra lines makes that comparison harder than it ought to be -- but that the current heuristic is doing a bad job of accounting for constant factors when picking the switchover point.

Why isn't the answer here just use a number other than two in the heuristic?

2 * (len1 + len2) < len2 * log2_fast(len1)

hanmertens · 2020-10-04T08:44:28Z

though two graphs with different scales and extra lines makes that comparison harder than it ought to be

Here's the two lines from a different measurement of the same benchmark in the same graph. It has a weird jump between 25000 and 27500 (artifact?), that's why included the more messy pictures initially.

when I look at the graphs in the OP what I see isn't that extend is always going to be better -- the slopes seem to be clearly different

The slopes are indeed very different, and extrapolating them would lead to a crossover point. However, this crossover point can never actually be reached because you can't extrapolate beyond the rightmost data point. The benchmark is set up such that two heaps are merged, one containing x elements and the other 100,000 - x. The rightmost data point at x = 50,000 corresponds to merging two equally sized heaps. The data point that would be to the right of that (x = 52,500) would give the same result as the data point to the left (x = 47,500) because both involve merging a heap of 47,500 and 52,500 elements and the heaps are swapped in one of them to make sure the smaller heap is append to the larger one. In other words, beyond the rightmost data point the slopes change (become negative) and the data would mirror the data to the left of that point (which is why I didn't include it). The crossover point is thus only reachable (in this benchmark) if you'd remove the swap-optimization (theoretically I mean, practically that would worsen performance, or course).

Why isn't the answer here just use a number other than two in the heuristic?

The TLDR is that it appears the number would have to be such that the condition always evaluates to false.

On another note, I've now also benchmarked merging very large heaps (100 million elements after merge), and extend is always more favorable there as well.

SkiFire13 · 2020-11-08T17:14:53Z

Just tried the benchmark with changes from PR #78857 and append scores significantly better. The heuristic could still be improved though.

scottmcm · 2020-11-08T20:26:31Z

Thanks for the update to remind me about this 🙂

One thing that I'd missed last time: extending from the contents of a BinaryHeap can never hit the worst case (needing to bubble to the root every time), since it's being extended by something that's in roughly-descending order.

Do you want to wait on #78857 before looking at #77435 more?

SkiFire13 · 2020-11-08T21:30:55Z

I think for now it would be better to wait and/or work towards a better heuristic (even though append can perform better, it's still up to 25% slower when the second heap is much smaller).

PS/Disclaimer: I'm the author of #78857 and not a reviewer, so I may be a little biased, however the benchmarks should speak for themself. Would be nice someone else could confirm my claims

hanmertens · 2020-11-08T22:25:55Z

Nice performance improvements in #78857, just a pity it makes determining the switchover heuristic more difficult ;).

I ran the benchmark again with #78857 and on my machine it also improves append's performance enough that it's the faster method when merging larger heaps. I guess it doesn't make much sense to merge #77435 as-is because of that. I'll have a look if I can determine a better heuristic based on the improved append performance.

hanmertens · 2020-12-13T22:48:26Z

In order to determine the new switchover point, I ran some benchmarks for heaps of different lengths. For a constant total heap length (heap length after the append operation), the execution time has an (approximately) linear relationship with the length of the smaller heap. This means the data can be fitted with a least-squares linear fit:

The blue line is the rebuild strategy, and purple the extend strategy. The green vertical line indicates the intersection of the fitted lines, i.e. where the switchover between strategies should ideally be. The red vertical line indicates the switchover point using the current heuristic. The fits for all the different other lengths can be found in the collapsible section at the end of this comment. I've plotted the location of the ideal switchover point as a function of the total heap length (the current heuristic is not a smooth curve because it uses a fast but inexact log2):

I think a couple points are noteworthy:

The current heuristic is quite close to the ideal point for smaller heaps.
The switchover point stays roughly constant (when considered as a percentage as in the graph above) for heap lengths 1000 to 100000.
For larger heaps, the ideal switchover point rises again, which is not reflected in the current heuristic.

I am not sure how to best model a heuristic for larger heaps, especially because for the power-of-2 heaps show it drops again for the two largest heaps. However, I think it should be fairly straightforward to improve the current heuristic by setting a lower bound on the switchover point:

if len1 + len2 < 4096 {
    2 * (len1 + len2) < len2 * log2_fast(len1)
} else {
    2 * (len1 + len2) < len2 * 11
}

This results in the green curve in the graph above. It is closer to the ideal switchover point in all cases, but for larger heaps it is still suboptimal. I'd love to hear ideas on how to improve the heuristic further for larger heaps.

Images of all fits

In all graphs, blue is the rebuild strategy and purple the extend strategy. The dots indicate raw data points and the line is a least-squares linear fit. The green vertical line indicates the intersection of the fitted lines, i.e. where the switchover between strategies should ideally be. The red vertical line indicates the current switchover point.

…ottmcm Always use extend in BinaryHeap::append This is faster, see rust-lang#77433. Fixes rust-lang#77433

…tmcm Always use extend in BinaryHeap::append This is faster, see rust-lang#77433. Fixes rust-lang#77433

This commit ports the change in the rebuild heuristic used by `BinaryHeap::append()` that was added in rust-lang/rust#77435: "Change BinaryHeap::append rebuild heuristic". See also the discussion in rust-lang/rust#77433: "Suboptimal performance of BinaryHeap::append" for more information on how the new heuristic was chosen. It also ports the new private method `.rebuild_tail()` now used by `std::collections::BinaryHeap::append()` from rust-lang/rust#78681: "Improve rebuilding behaviour of BinaryHeap::retain". Note that Rust 1.60.0 adds the clippy lint `manual_bits` which warns against code used here. We suppress the lint instead of following the upstream patch which now uses `usize::BITS`, since this was stabilized in Rust 1.53.0 and this crate's MSRV is currently 1.36.0.

hanmertens mentioned this issue Oct 1, 2020

Change BinaryHeap::append rebuild heuristic #77435

Merged

JohnTitor added a commit to JohnTitor/rust that referenced this issue Jan 16, 2021

Rollup merge of rust-lang#77435 - hanmertens:binary_heap_append, r=sc…

d0f8553

…ottmcm Always use extend in BinaryHeap::append This is faster, see rust-lang#77433. Fixes rust-lang#77433

bors added a commit to rust-lang-ci/rust that referenced this issue Jan 16, 2021

Auto merge of rust-lang#77435 - hanmertens:binary_heap_append, r=scot…

410a546

…tmcm Always use extend in BinaryHeap::append This is faster, see rust-lang#77433. Fixes rust-lang#77433

bors closed this as completed in 32a20f4 Jan 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal performance of BinaryHeap::append #77433

Suboptimal performance of BinaryHeap::append #77433

hanmertens commented Oct 1, 2020 •

edited

Loading

`Cargo.toml`

`benches/benchmark.rs`

the8472 commented Oct 2, 2020

hanmertens commented Oct 2, 2020

the8472 commented Oct 2, 2020

hanmertens commented Oct 2, 2020

scottmcm commented Oct 3, 2020

hanmertens commented Oct 4, 2020

SkiFire13 commented Nov 8, 2020

scottmcm commented Nov 8, 2020

SkiFire13 commented Nov 8, 2020

hanmertens commented Nov 8, 2020

hanmertens commented Dec 13, 2020

Suboptimal performance of BinaryHeap::append #77433

Suboptimal performance of BinaryHeap::append #77433

Comments

hanmertens commented Oct 1, 2020 • edited Loading

Cargo.toml

benches/benchmark.rs

the8472 commented Oct 2, 2020

hanmertens commented Oct 2, 2020

the8472 commented Oct 2, 2020

hanmertens commented Oct 2, 2020

scottmcm commented Oct 3, 2020

hanmertens commented Oct 4, 2020

SkiFire13 commented Nov 8, 2020

scottmcm commented Nov 8, 2020

SkiFire13 commented Nov 8, 2020

hanmertens commented Nov 8, 2020

hanmertens commented Dec 13, 2020

hanmertens commented Oct 1, 2020 •

edited

Loading

`Cargo.toml`

`benches/benchmark.rs`