Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize insertion to only use a single lookup #277

Merged
merged 1 commit into from
Apr 1, 2023

Conversation

Zoxc
Copy link
Contributor

@Zoxc Zoxc commented Jul 9, 2021

This changes the insert method to use a single lookup for insertion instead of 2 separate lookups.

This reduces runtime of rustc on check builds by an average of 0.5% on local benchmarks (1% for winapi). The compiler size is reduced by 0.45%.

unwrap_unchecked is lifted from libstd since it's currently unstable and that probably requires some attribution.

Copy link
Member

@Amanieu Amanieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I'd like to see the impact on the full set of rust-perf benchmarks first though.

src/raw/mod.rs Outdated Show resolved Hide resolved
src/raw/mod.rs Outdated Show resolved Hide resolved
src/raw/mod.rs Outdated Show resolved Hide resolved
src/raw/mod.rs Outdated Show resolved Hide resolved
src/raw/mod.rs Outdated Show resolved Hide resolved
src/raw/mod.rs Show resolved Hide resolved
@Zoxc
Copy link
Contributor Author

Zoxc commented Jul 13, 2021

I did some newer rustc benchmarks and it seems performance is now more neutral. The size reduction is down to 0.28%.

clap:check                        1.9650s   1.9607s  -0.22%
helloworld:check                  0.0444s   0.0442s  -0.58%
hyper:check                       0.2991s   0.2997s  +0.20%
regex:check                       1.1642s   1.1614s  -0.24%
syn:check                         1.7067s   1.7051s  -0.09%
syntex_syntax:check               6.9250s   6.9132s  -0.17%
winapi:check                      8.3790s   8.3943s  +0.18%

Total                            20.4834s  20.4786s  -0.02%
Summary                           3.5000s   3.4954s  -0.13%

Adding a branch hint to reserve improved things (code size reduction at 0.27%):

clap:check                        1.9603s   1.9487s  -0.59%
helloworld:check                  0.0436s   0.0436s  +0.10%
hyper:check                       0.2984s   0.2983s  -0.02%
regex:check                       1.1535s   1.1471s  -0.56%
syn:check                         1.7065s   1.6952s  -0.66%
syntex_syntax:check               6.8963s   6.8691s  -0.39%
winapi:check                      8.3529s   8.3326s  -0.24%

Total                            20.4115s  20.3347s  -0.38%
Summary                           3.5000s   3.4881s  -0.34%

I think rustc ends up being more sensitive to inlining / code layout than the real changes here. I may try to use the new LLVM pass manager and see if that behaves the same way.

@Amanieu
Copy link
Member

Amanieu commented Jul 13, 2021

check builds are a bit misleading because they stop before monomorphization happens. One concern I have with the code as it is currently is that there is now much more code that is generic over T. This will likely cause a compilation time regression in debug/opt builds due to the increased amount of code generated by HashMap.

@Zoxc
Copy link
Contributor Author

Zoxc commented Jul 13, 2021

I'm using check builds to measure the run-time of hashbrown and explicitly exclude compile-time. Some compile-time could sneak in as applying hashbrown to stage0 only is no longer trivial (with both Cargo and Rust 2018 not supporting conditional crates).

When insert is fully inlined, this PR reduces the lookups from 3 to 1, which is probably why this sees a code size improvement (with optimizations). so I'm not sure if release build times will increase. Debug builds could still suffer though. Some dynamic dispatch may be helpful there assuming it can be optimized away on release builds. Do you know of some smaller crates than cargo that heavily relies on HashMap I could thrown in my benchmarks suite?

@Amanieu
Copy link
Member

Amanieu commented Jul 14, 2021

You might want to measure using cargo-llvm-lines to compare the amount of LLVM IR generated before & after this change. See #205 for a test program that you can use to measure.

@Zoxc
Copy link
Contributor Author

Zoxc commented Jul 14, 2021

I did some measurements of code in debug mode. I also tried making find_potential and search generic using dynamic dispatch. LLVM seems to be able to inline and produce similar code in release mode with that change.

Type Before PR PR with dynamic dispatch
llvm-lines 15081 15843 (5.05%) 15100 (0.13%)
rustc-driver bytes 226,703,360 227,664,384 (0.42%) 226,968,064 (0.12%)
cargo bytes 31,037,952 31,209,984 (0.55%) 31,093,248 (0.18%)

src/raw/mod.rs Outdated
@@ -1206,6 +1222,90 @@ impl<A: Allocator + Clone> RawTableInner<A> {
}
}

/// Finds the position to insert something in a group.
#[inline]
unsafe fn find_insert_slot_in_group(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any situation when calling this function could be unsafe. The invariants of RawTable guarantee that accessing the first Group::WIDTH control bytes is always valid.

src/raw/mod.rs Outdated
}

/// Searches for an element in the table,
/// or a potential slot where that element could be inserted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with #279, add a comment explaining why we are using dynamic dispatch here.

src/raw/mod.rs Outdated
Comment on lines 1285 to 1697
let index = self.find_insert_slot_in_group(&group, &probe_seq);

if likely(index.is_some()) {
// Only stop the search if the group is empty. The element might be
// in a following group.
if likely(group.match_empty().any_bit_set()) {
// Use a tombstone if we found one
return if unlikely(tombstone.is_some()) {
(tombstone.unwrap(), false)
} else {
(index.unwrap(), false)
};
} else {
// We found a tombstone, record it so we can return it as a potential
// insertion location.
tombstone = index;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would reorganize this code like this, which makes it much clearer:

// We didn't find the element we were looking for in the group, try to get an
// insertion slot from the group if we don't have one yet.
if insert_slot.is_none() {
    insert_slot = self.find_insert_slot_in_group(&group, &probe_seq);
}

// Only stop the search if the group contains at least one empty element.
// Otherwise, the element that we are looking for might be in a following group.
if likely(group.match_empty().any_bit_set()) {
    return (insert_slot.unchecked_unwrap(), false);
}

(tombstone is renamed to insert_slot)

@Zoxc
Copy link
Contributor Author

Zoxc commented Jul 15, 2021

I verified that this is still a performance win for rustc after #279 landed. New benchmark run (size reduction now at 0.19%):

clap:check                        1.9295s   1.9188s  -0.55%
helloworld:check                  0.0440s   0.0437s  -0.62%
hyper:check                       0.2950s   0.2928s  -0.76%
regex:check                       1.1416s   1.1341s  -0.66%
syn:check                         1.6833s   1.6720s  -0.67%
syntex_syntax:check               6.7729s   6.7454s  -0.41%
winapi:check                      8.2021s   8.1426s  -0.73%

Total                            20.0683s  19.9494s  -0.59%
Summary                           3.5000s   3.4781s  -0.63%

Any suggestions on what to do with the rehash_in_place benchmark? That seems to benchmark reusing tombstones at exactly max capacity, which will expand with this PR. Should it be changed to use half capacity so it will trigger rehashes?

@Zoxc
Copy link
Contributor Author

Zoxc commented Jul 15, 2021

I made a test crate with 117 HashMap instances to see what effect this has on compile times:

use std::collections::HashMap;
use std::fmt::Debug;
use std::hash::Hash;

fn map<K: Hash + Debug + Eq + Clone, V: Debug>(k: K, v: V) {
    let mut map = HashMap::new();
    map.insert(k.clone(), v);
    map.reserve(1000);
    dbg!(map.get(&k), map.iter().next());
}

fn values<K: Hash + Debug + Eq + Clone>(k: K) {
    map(k.clone(), ());
    map(k.clone(), "");
    map(k.clone(), true);
    map(k.clone(), 1i8);
    map(k.clone(), 1u8);
    map(k.clone(), 1u32);
    map(k.clone(), 1i32);
    map(k.clone(), vec![1u32]);
    map(k.clone(), vec![1i32]);
}

fn main() {
    values(());
    values("");
    values(true);
    values(1i8);
    values(1u8);
    values(1u64);
    values(1i64);
    values(1usize);
    values(1isize);
    values(String::new());
    values(vec![""]);
    values(vec![1u32]);
    values(vec![1i32]);
}

Results:

hashmap-instances:check           0.0628s   0.0625s  -0.60%
hashmap-instances:debug           7.8956s   8.0046s  +1.38%
hashmap-instances:release        32.9421s  32.1129s  -2.52%

@Amanieu
Copy link
Member

Amanieu commented Jul 16, 2021

At this point, I'd like to see some real benchmarks from rust-perf. You should open a PR in rust-lang/rust where you replace hashbrown in std with your branch. That way we can use the rust timer bot to get some numbers on compilation time. (Like we did for rust-lang/rust#77566)

bors added a commit that referenced this pull request Jul 21, 2021
Make rehashing and resizing less generic

This makes the code in `rehash_in_place`, `resize` and `reserve_rehash` less generic on `T`. It also improves the performance of rustc. That performance increase in partially attributed to the use of `#[inline(always)]`.

This is the effect on rustc runtime:
```
clap:check                        1.9523s   1.9327s  -1.00%
hashmap-instances:check           0.0628s   0.0624s  -0.57%
helloworld:check                  0.0438s   0.0436s  -0.50%
hyper:check                       0.2987s   0.2970s  -0.59%
regex:check                       1.1497s   1.1402s  -0.82%
syn:check                         1.7004s   1.6851s  -0.90%
syntex_syntax:check               6.9232s   6.8546s  -0.99%
winapi:check                      8.3220s   8.2857s  -0.44%

Total                            20.4528s  20.3014s  -0.74%
Summary                           4.0000s   3.9709s  -0.73%
```
`rustc_driver`'s code size is increased by 0.02%.

This is the effect on compile time this has on my [HashMap compile time benchmark](#277 (comment)):
```
hashmap-instances:check           0.0636s   0.0632s  -0.61%
hashmap-instances:release        33.0166s  32.2487s  -2.33%
hashmap-instances:debug           7.8677s   7.2012s  -8.47%

Total                            40.9479s  39.5131s  -3.50%
Summary                           1.5000s   1.4430s  -3.80%
```
The `hashmap-instances:debug` compile time could be further improved if there was a way to apply `#[inline(always)]` only on release builds.
@Zoxc Zoxc mentioned this pull request Jul 21, 2021
bors added a commit that referenced this pull request Jul 21, 2021
Inline small functions

This adds `#[inline]` to small functions which should be beneficial to inline.

rustc compilation performance (code size of `rustc_driver` up by 0.09%):
```
clap:check                        1.9486s   1.9416s  -0.36%
hashmap-instances:check           0.0629s   0.0626s  -0.52%
helloworld:check                  0.0443s   0.0439s  -0.69%
hyper:check                       0.3011s   0.3000s  -0.36%
regex:check                       1.1505s   1.1468s  -0.33%
syn:check                         1.6989s   1.6904s  -0.50%
syntex_syntax:check               6.8479s   6.8288s  -0.28%
winapi:check                      8.3437s   8.2967s  -0.56%

Total                            20.3979s  20.3108s  -0.43%
Summary                           4.0000s   3.9820s  -0.45%
```

This is the effect on compile time this has on my [HashMap compile time benchmark](#277 (comment)):
```
hashmap-instances:check           0.0635s   0.0632s  -0.33%
hashmap-instances:release        32.0928s  32.4440s  +1.09%
hashmap-instances:debug           7.2193s   7.2800s  +0.84%

Total                            39.3756s  39.7873s  +1.05%
Summary                           1.5000s   1.5080s  +0.54%
```

We saw a 1.6% improvement in rustc's build time for a -20% improvement on `hashmap-instances:release` on rust-lang/rust#87233. So I would expect around a 0.08% regression for rustc's build time from this PR.
@Zoxc
Copy link
Contributor Author

Zoxc commented Jul 22, 2021

I re-ran the benchmarks after #283 landed.

clap:check                        1.9603s   1.9501s  -0.52%
hashmap-instances:check           0.0632s   0.0626s  -0.95%
helloworld:check                  0.0437s   0.0435s  -0.46%
hyper:check                       0.2971s   0.2949s  -0.72%
regex:check                       1.1596s   1.1548s  -0.42%
syn:check                         1.7156s   1.7048s  -0.63%
syntex_syntax:check               6.8806s   6.8575s  -0.34%
winapi:check                      8.3276s   8.2277s  -1.20%

Total                            20.4478s  20.2959s  -0.74%
Summary                           4.0000s   3.9738s  -0.66%
hashmap-instances:check           0.0624s   0.0622s  -0.23%
hashmap-instances:release        32.3625s  31.2730s  -3.37%
hashmap-instances:debug           7.2383s   7.3376s  +1.37%

Total                            39.6632s  38.6728s  -2.50%
Summary                           1.5000s   1.4889s  -0.74%

@Zoxc
Copy link
Contributor Author

Zoxc commented Jul 23, 2021

New perf run results

@Amanieu
Copy link
Member

Amanieu commented Jul 24, 2021

The rustc-perf results are unfortunately rather disappointing and I don't feel comfortable merging this PR as it is. This PR muddies the water a bit since it involves two distinct changes: combining the insert/lookup and reducing LLVM IR generated.

It would be best to explore each separately so that we get a better understanding of the perf characteristics:

  • The normal insert function could be decomposed into a generic and non-generic part to see what effect that has on performance.
  • The new insertion algorithm could be extended to also apply to Entry. This should be possible since VacantEntry can hold the index into which to insert an element.

@Zoxc
Copy link
Contributor Author

Zoxc commented Jul 25, 2021

The rustc-perf results are unfortunately rather disappointing

I assume you're talking about the instruction counts, if so, may I ask why you seem to value those over wall time benchmarks?

It would be best to explore each separately so that we get a better understanding of the perf characteristics

I did verify that the code with dynamic dispatch optimizes to similar x86-64 assembly to code without. That does assume that LLVM's inlining plays along however.

The normal insert function could be decomposed into a generic and non-generic part to see what effect that has on performance.

I'm not sure what do you mean here. All 3 lookups on insert on master has since #279 been in the less generic part. There doesn't seem to be a lot more to move out of it.

The new insertion algorithm could be extended to also apply to Entry. This should be possible since VacantEntry can hold the index into which to insert an element.

I do think this makes sense. insert is hotter for rustc, so I wouldn't expect this to alter benchmark results much.

@Zoxc
Copy link
Contributor Author

Zoxc commented Jul 25, 2021

I did an experiment where I put #[inline(never)] on HashMap::insert and built libstd with the inline-more feature. This is to make LLVM's inlining more consistent. I tried 3 variations on HashMap::insert:

  1. find; find_insert_slot; reserve; find_insert_slot (what's on master now)

  2. reserve; find_potential (this PR)

  3. find; reserve; find_insert_slot

These are the results. The number are in order and the percentages are relative to the first column.

clap:check                        1.9480s   1.9552s  +0.37%   1.9466s  -0.07%
hashmap-instances:check           0.0635s   0.0634s  -0.16%   0.0633s  -0.29%
helloworld:check                  0.0443s   0.0442s  -0.28%   0.0441s  -0.48%
hyper:check                       0.2958s   0.2961s  +0.11%   0.2955s  -0.10%
regex:check                       1.1421s   1.1431s  +0.09%   1.1423s  +0.02%
syn:check                         1.6760s   1.6764s  +0.02%   1.6727s  -0.20%
syntex_syntax:check               6.7950s   6.8020s  +0.10%   6.7909s  -0.06%
winapi:check                      8.2145s   8.1965s  -0.22%   8.1979s  -0.20%

Total                            20.1793s  20.1769s  -0.01%  20.1534s  -0.13%
Summary                           2.6667s   2.6668s  +0.00%   2.6621s  -0.17%

There doesn't seem to be a significant difference between them (for rustc at least). This data seems to support my hypothesis that the benefit of this PR is due to inlining, where the smaller size of insert is beneficial.

@Amanieu
Copy link
Member

Amanieu commented Jul 25, 2021

I assume you're talking about the instruction counts, if so, may I ask why you seem to value those over wall time benchmarks?

Actually I also looked at the CPU cycles on rustc-perf (cycles:u) and the results still don't look very good.

The normal insert function could be decomposed into a generic and non-generic part to see what effect that has on performance.

I'm not sure what do you mean here. All 3 lookups on insert on master has since #279 been in the less generic part. There doesn't seem to be a lot more to move out of it.

I'm referring to RawTable::insert. Most of it doesn't depend on T and could be moved to RawTableInner like you did with find.

There doesn't seem to be a significant difference between them (for rustc at least). This data seems to support my hypothesis that the benefit of this PR is due to inlining, where the smaller size of insert is beneficial.

In that case let's stick with the existing algorithm and just focusing on the inlining optimizations.

@Zoxc
Copy link
Contributor Author

Zoxc commented Jul 25, 2021

Actually I also looked at the CPU cycles on rustc-perf (cycles:u) and the results still don't look very good.

I noticed that there's more crates than improve than regresses on wall time, on both perf runs. I wonder if cycles:u actually diverge from wall time on perf here. I do wish there was a sum of all crates on perf, which would be more precise. It's also be possible that the difference is due to Zen (my CPU) vs Zen 2 (perf server).

I'm referring to RawTable::insert. Most of it doesn't depend on T and could be moved to RawTableInner like you did with find.

The reserve call in the middle is a bit annoying. I worked around it by some cheeky unsafe code though. I'll see how it performs.

In that case let's stick with the existing algorithm and just focusing on the inlining optimizations.

I meant that this PR has a smaller algorithm, so it inlines better. I haven't done any pure inlining optimizations here.

@Amanieu
Copy link
Member

Amanieu commented Jul 25, 2021

I noticed that there's more crates than improve than regresses on wall time, on both perf runs. I wonder if cycles:u actually diverge from wall time on perf here. I do wish there was a sum of all crates on perf, which would be more precise. It's also be possible that the difference is due to Zen (my CPU) vs Zen 2 (perf server).

Wall-time can be very noisy when multiple threads are involved (codegen units). task-clock is more accurate since it only measures time when a thread is performing actual work rather than being blocked waiting for another thread. cycles is even more accurate since it is not affected by changes in CPU frequency: an add instruction takes 1 cycle no matter what frequency the CPU is running at.

@Zoxc
Copy link
Contributor Author

Zoxc commented Jul 25, 2021

I tries a less generic insert variant of the current algoritm:

#[cfg_attr(feature = "inline-more", inline)]
pub fn insert(&mut self, hash: u64, value: T, hasher: impl Fn(&T) -> u64) -> Bucket<T> {
    unsafe {
        let index = self.table.mark_insert(hash, &move |table| {
            (*(table as *mut RawTableInner<A> as *mut Self)).reserve(1, &hasher)
        });
        let bucket = self.bucket(index);
        bucket.write(value);
        bucket
    }
}
#[inline]
unsafe fn mark_insert(&mut self, hash: u64, reserve_one: &dyn Fn(&mut Self)) -> usize {
    let mut index = self.find_insert_slot(hash);

    // We can avoid growing the table once we have reached our load
    // factor if we are replacing a tombstone. This works since the
    // number of EMPTY slots does not change in this case.
    let old_ctrl = *self.ctrl(index);
    if unlikely(self.growth_left == 0 && special_is_empty(old_ctrl)) {
        reserve_one(self);
        index = self.find_insert_slot(hash);
    }

    self.record_item_insert_at(index, old_ctrl, hash);

    index
}

Compile times take a hit:

hashmap-instances:check           0.0636s   0.0637s  +0.10%
hashmap-instances:release        32.5219s  39.9375s +22.80%
hashmap-instances:debug           7.2875s   8.8993s +22.12%

rustc performance does improve:

clap:check                        1.9578s   1.9493s  -0.44%
hashmap-instances:check           0.0631s   0.0631s  -0.06%
helloworld:check                  0.0441s   0.0440s  -0.31%
hyper:check                       0.2972s   0.2965s  -0.26%
regex:check                       1.1550s   1.1511s  -0.34%
syn:check                         1.6938s   1.6893s  -0.26%
syntex_syntax:check               6.9113s   6.9025s  -0.13%
winapi:check                      8.3605s   8.2682s  -1.10%

Total                            20.4829s  20.3640s  -0.58%
Summary                           4.0000s   3.9855s  -0.36%

@Zoxc
Copy link
Contributor Author

Zoxc commented Jul 26, 2021

Here's the result of the perf run with #[inline(never] applied to HashMap::insert again. This shows the instruction count improvement I was expecting, due to the reduced number of lookups. So the regression on instruction count seems to be caused by different inlining behavior.

@Amanieu
Copy link
Member

Amanieu commented Jul 30, 2021

I can see the improvement now. However the inlining is still a problem: we want rustc to have good performance but at the same time we want to limit inlining so that crates still compile reasonably fast.

I'm honestly not sure what the best approach to take here is.

@Zoxc
Copy link
Contributor Author

Zoxc commented Aug 7, 2021

The find; reserve; find_insert_slot variant could be worth considering. That would probably get rid of the 1% regression on hashmap-instances:debug, but the rustc performance improvement is reduced.

@Marwes
Copy link
Contributor

Marwes commented Aug 30, 2021

An alternative could be to write find_potential in terms of an "iterator" (like ProbeSeq) https://github.com/marwes/hashbrown/tree/opt-insert . Reduces the llvm line count by ~1% compared to the PR currently while also removing the dynamic dispatch.

@Zoxc
Copy link
Contributor Author

Zoxc commented Feb 5, 2023

I rebased and reran some rustc benchmarks. The result seem pretty similar, but hashmap-instances:release seems to have improved.

BenchmarkBeforeAfter
TimeTime%
🟣 winapi:check7.5640s7.5308s -0.44%
🟣 clap:check1.8297s1.8249s -0.26%
🟣 hyper:check0.2645s0.2645s 0.03%
🟣 syntex_syntax:check6.3546s6.3379s -0.26%
🟣 syn:check1.6337s1.6208s -0.79%
🟣 regex:check1.0244s1.0182s -0.61%
Total18.6709s18.5972s -0.39%
Summary1.0000s0.9961s -0.39%
BenchmarkBeforeAfter
TimeTime%
🟣 hashmap-instances:check0.0542s0.0539s -0.56%
🔵 hashmap-instances:release21.4382s19.4949s💚 -9.06%
🟠 hashmap-instances:debug5.6909s5.8824s💔 3.37%
Total27.1833s25.4313s💚 -6.45%
Summary1.0000s0.9791s💚 -2.09%

@bors
Copy link
Contributor

bors commented Feb 23, 2023

☔ The latest upstream changes (presumably #405) made this pull request unmergeable. Please resolve the merge conflicts.

@JustForFun88
Copy link
Contributor

You've run into the same bug as me. In fact, you cannot check for full buckets and empty and deleted buckets at the same time. You should first check all full buckets up to the moment you find an empty bucket, and only then look for empty and deleted buckets. This is necessary, since there may be collisions in the map

@JustForFun88
Copy link
Contributor

Here my last attempt master...JustForFun88:hashbrown:one_lookup_other. It gives a performance increase, but very small from 0 to 5 percent.

src/raw/mod.rs Outdated
.match_empty_or_deleted()
.lowest_set_bit_nonzero();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check can be moved out of the loop in find_potential_inner so that it is only executed if we find out that the actual element doesn't exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to do this test on loop exit instead.

// We didn't find the element we were looking for in the group, try to get an
// insertion slot from the group if we don't have one yet.
if likely(insert_slot.is_none()) {
insert_slot = self.find_insert_slot_in_group(&group, &probe_seq);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could add a fast path here to immediately continue the loop if the returned insert_slot is None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would add a conditional to the common case of finding an insertion slot though, and only save group.match_empty() which is cheap already.

@Amanieu
Copy link
Member

Amanieu commented Feb 23, 2023

You've run into the same bug as me. In fact, you cannot check for full buckets and empty and deleted buckets at the same time. You should first check all full buckets up to the moment you find an empty bucket, and only then look for empty and deleted buckets. This is necessary, since there may be collisions in the map

Can you expand on this? It's not clear to me why the code is incorrect. It's essentially merging both loops to run them at the same time.

@JustForFun88
Copy link
Contributor

Can you expand on this? It's not clear to me why the code is incorrect. It's essentially merging both loops to run them at the same time.

Oh, sorry. I did not study the code enough, I just looked at why rehash_in_place fails and thought that here is the same as my first version of the code. It seems that this version has problems specifically with rehashing due to early reservation of additional space.

In principle, here master...JustForFun88:hashbrown:one_lookup_other is an equivalent version that works correctly without early reservation of additional space. It gives an acceleration of up to 10%.

We can combine both approaches.

Cargo bench

running 41 tests                                 Old insert                       New insert

test grow_insert_ahash_highbits  ... bench:      40,496 ns/iter (+/- 468)         38,005 ns/iter (+/- 7,215) 
test grow_insert_ahash_random    ... bench:      37,206 ns/iter (+/- 3,576)       36,579 ns/iter (+/- 5,114) 
test grow_insert_ahash_serial    ... bench:      37,302 ns/iter (+/- 1,749)       37,444 ns/iter (+/- 4,713) 
test grow_insert_std_highbits    ... bench:      59,918 ns/iter (+/- 8,144)       53,695 ns/iter (+/- 7,514) 
test grow_insert_std_random      ... bench:      53,260 ns/iter (+/- 6,954)       52,570 ns/iter (+/- 330)   
test grow_insert_std_serial      ... bench:      59,310 ns/iter (+/- 7,552)       53,242 ns/iter (+/- 232)   
test insert_ahash_highbits       ... bench:      33,941 ns/iter (+/- 402)         25,310 ns/iter (+/- 435)   
test insert_ahash_random         ... bench:      33,520 ns/iter (+/- 197)         27,669 ns/iter (+/- 6,837) 
test insert_ahash_serial         ... bench:      33,553 ns/iter (+/- 213)         25,194 ns/iter (+/- 162)   
test insert_erase_ahash_highbits ... bench:      31,639 ns/iter (+/- 8,427)       28,809 ns/iter (+/- 726)   
test insert_erase_ahash_random   ... bench:      30,538 ns/iter (+/- 3,425)       27,991 ns/iter (+/- 1,656) 
test insert_erase_ahash_serial   ... bench:      29,167 ns/iter (+/- 8,472)       27,278 ns/iter (+/- 6,051) 
test insert_erase_std_highbits   ... bench:      55,232 ns/iter (+/- 2,127)       54,485 ns/iter (+/- 20,457)
test insert_erase_std_random     ... bench:      55,272 ns/iter (+/- 1,277)       55,869 ns/iter (+/- 1,719) 
test insert_erase_std_serial     ... bench:      54,436 ns/iter (+/- 409)         54,708 ns/iter (+/- 533)   
test insert_std_highbits         ... bench:      49,024 ns/iter (+/- 617)         46,770 ns/iter (+/- 1,417) 
test insert_std_random           ... bench:      48,701 ns/iter (+/- 8,394)       44,607 ns/iter (+/- 443)   
test insert_std_serial           ... bench:      48,655 ns/iter (+/- 8,204)       44,600 ns/iter (+/- 365)   
test rehash_in_place             ... bench:     278,943 ns/iter (+/- 2,397)      277,346 ns/iter (+/- 1,913) 

running 2 tests

test insert                  ... bench:      10,316 ns/iter (+/- 151)             9,713 ns/iter (+/- 127)
test insert_unique_unchecked ... bench:       7,902 ns/iter (+/- 86)              7,896 ns/iter (+/- 106)

@Amanieu
Copy link
Member

Amanieu commented Feb 23, 2023

@JustForFun88 Isn't your implementation basically the same as what we already had before: first a lookup to find a matching entry, followed by a search for an empty slot if the lookup failed.

@JustForFun88
Copy link
Contributor

@JustForFun88 Isn't your implementation basically the same as what we already had before: first a lookup to find a matching entry, followed by a search for an empty slot if the lookup failed.

Yes you are right. It all boils down to this one way or another. I tried to do something similar to this pull request (master...JustForFun88:hashbrown:one_lookup_fourth). Didn't get any performance improvement from this. I think this is due to the fact that an additional branch is added to the loop (if likely(insert_slot.is_none()) in the case of this pull request, or if likely(!found_empty_or_deleted_slot) in the case of my fourth attempt).
A few percent improvement specifically in master...JustForFun88:hashbrown:one_lookup_other I associate rather with:

  1. A single read (Group::load) from the heap at the initial moment, which in most cases ends in success (that is, either we find an existing element immediately in the first group, or we find an empty or deleted element in the same group (master...JustForFun88:hashbrown:one_lookup_other#diff-655778b213c501f917b62cd79d54725b7fcdc321c37b2f471b5c77ae2d3d818eR1199-R1200):
        let mut group_insert = unsafe { Group::load(self.ctrl(probe_seq.pos)) };
        let mut group_find = group_insert;
  1. With the fact that we are skeptical about the existence of an element in the table (master...JustForFun88:hashbrown:one_lookup_other#diff-655778b213c501f917b62cd79d54725b7fcdc321c37b2f471b5c77ae2d3d818eR850-R853):
        if unlikely(found) {
            // found = true
            return (self.bucket(index), found);
        }

But in general, the acceleration is not great, which is why I closed my PR.

@Amanieu
Copy link
Member

Amanieu commented Feb 26, 2023

Have you tried benchmarking your optimization on the rustc perf suite? It's much better to benchmark a real program (rustc itself make heavy use of hashmaps), and we've seen good improvements there that don't show up in microbenchmarks.

To do this you will need to build rust from source with hashbrown in the standard library changed to point to your branch.

@bors
Copy link
Contributor

bors commented Mar 29, 2023

☔ The latest upstream changes (presumably #411) made this pull request unmergeable. Please resolve the merge conflicts.

@Amanieu
Copy link
Member

Amanieu commented Mar 31, 2023

I just benchmarked this branch against master with rustc-perf and found almost no difference in perf.

Summary

Range Mean Count
Regressions 0.31%, 0.73% 0.42% 8
Improvements -0.32%, -0.32% -0.32% 1
All -0.32%, 0.73% 0.34% 9

Primary benchmarks

Benchmark Profile Scenario % Change Significance Factor
ripgrep-13.0.0 check incr-unchanged -0.32% 1.58x

Secondary benchmarks

Benchmark Profile Scenario % Change Significance Factor
issue-58319 check incr-full 0.73% 3.65x
deep-vector check incr-patched: println 0.41% 2.03x
deep-vector check incr-full 0.41% 2.03x
deep-vector check incr-patched: add vec item 0.41% 2.03x
ucd check incr-full 0.39% 1.93x
ctfe-stress-5 check incr-full 0.39% 1.93x
tuple-stress check incr-patched: new row 0.34% 1.69x
tuple-stress check incr-full 0.31% 1.54x

@Amanieu
Copy link
Member

Amanieu commented Mar 31, 2023

The benchmarks do look much better though:

 name                         orig ns/iter  new ns/iter  diff ns/iter   diff %  speedup 
 insert_ahash_highbits        26,869        18,092             -8,777  -32.67%   x 1.49 
 insert_ahash_random          26,973        18,349             -8,624  -31.97%   x 1.47 
 insert_ahash_serial          27,071        18,290             -8,781  -32.44%   x 1.48 
 insert_std_highbits          33,940        30,837             -3,103   -9.14%   x 1.10 
 insert_std_random            34,001        30,517             -3,484  -10.25%   x 1.11 
 insert_std_serial            33,894        30,478             -3,416  -10.08%   x 1.11 

@Amanieu
Copy link
Member

Amanieu commented Mar 31, 2023

I re-ran the benchmarks, this time accounting for inlining changes by force-inlining everything and the results are still good with ~5% speedup. I think this is good to merge!

Any suggestions on what to do with the rehash_in_place benchmark? That seems to benchmark reusing tombstones at exactly max capacity, which will expand with this PR. Should it be changed to use half capacity so it will trigger rehashes?

Just change that benchmark to only insert 223 times instead of 224 times. The issue is that this PR makes insert always grow instead of first checking whether it can reuse a tombstone, but this is fine in practice.

@Zoxc
Copy link
Contributor Author

Zoxc commented Apr 1, 2023

I updated the benchmark.

@Amanieu
Copy link
Member

Amanieu commented Apr 1, 2023

@bors r+

@bors
Copy link
Contributor

bors commented Apr 1, 2023

📌 Commit b9d97a8 has been approved by Amanieu

It is now in the queue for this repository.

bors added a commit that referenced this pull request Apr 1, 2023
Optimize insertion to only use a single lookup

This changes the `insert` method to use a single lookup for insertion instead of 2 separate lookups.

This reduces runtime of rustc on check builds by an average of 0.5% on [local benchmarks](https://github.com/Zoxc/rcb/tree/93790089032fc3fb4a4d708fb0adee9551125916/benchs) (1% for `winapi`). The compiler size is reduced by 0.45%.

`unwrap_unchecked` is lifted from `libstd` since it's currently unstable and that probably requires some attribution.
@bors
Copy link
Contributor

bors commented Apr 1, 2023

⌛ Testing commit b9d97a8 with merge 19db643...

@bors
Copy link
Contributor

bors commented Apr 1, 2023

💔 Test failed - checks-actions

@Amanieu
Copy link
Member

Amanieu commented Apr 1, 2023

@bors retry

@bors
Copy link
Contributor

bors commented Apr 1, 2023

⌛ Testing commit b9d97a8 with merge 329f86a...

@bors
Copy link
Contributor

bors commented Apr 1, 2023

☀️ Test successful - checks-actions
Approved by: Amanieu
Pushing 329f86a to master...

1 similar comment
@bors
Copy link
Contributor

bors commented Apr 1, 2023

☀️ Test successful - checks-actions
Approved by: Amanieu
Pushing 329f86a to master...

@bors bors merged commit 329f86a into rust-lang:master Apr 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants