-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize insertion to only use a single lookup #277
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I'd like to see the impact on the full set of rust-perf benchmarks first though.
I did some newer rustc benchmarks and it seems performance is now more neutral. The size reduction is down to 0.28%.
Adding a branch hint to
I think |
|
I'm using When |
You might want to measure using cargo-llvm-lines to compare the amount of LLVM IR generated before & after this change. See #205 for a test program that you can use to measure. |
I did some measurements of code in debug mode. I also tried making
|
src/raw/mod.rs
Outdated
@@ -1206,6 +1222,90 @@ impl<A: Allocator + Clone> RawTableInner<A> { | |||
} | |||
} | |||
|
|||
/// Finds the position to insert something in a group. | |||
#[inline] | |||
unsafe fn find_insert_slot_in_group( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any situation when calling this function could be unsafe. The invariants of RawTable
guarantee that accessing the first Group::WIDTH
control bytes is always valid.
src/raw/mod.rs
Outdated
} | ||
|
||
/// Searches for an element in the table, | ||
/// or a potential slot where that element could be inserted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with #279, add a comment explaining why we are using dynamic dispatch here.
src/raw/mod.rs
Outdated
let index = self.find_insert_slot_in_group(&group, &probe_seq); | ||
|
||
if likely(index.is_some()) { | ||
// Only stop the search if the group is empty. The element might be | ||
// in a following group. | ||
if likely(group.match_empty().any_bit_set()) { | ||
// Use a tombstone if we found one | ||
return if unlikely(tombstone.is_some()) { | ||
(tombstone.unwrap(), false) | ||
} else { | ||
(index.unwrap(), false) | ||
}; | ||
} else { | ||
// We found a tombstone, record it so we can return it as a potential | ||
// insertion location. | ||
tombstone = index; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would reorganize this code like this, which makes it much clearer:
// We didn't find the element we were looking for in the group, try to get an
// insertion slot from the group if we don't have one yet.
if insert_slot.is_none() {
insert_slot = self.find_insert_slot_in_group(&group, &probe_seq);
}
// Only stop the search if the group contains at least one empty element.
// Otherwise, the element that we are looking for might be in a following group.
if likely(group.match_empty().any_bit_set()) {
return (insert_slot.unchecked_unwrap(), false);
}
(tombstone
is renamed to insert_slot
)
I verified that this is still a performance win for rustc after #279 landed. New benchmark run (size reduction now at 0.19%):
Any suggestions on what to do with the |
I made a test crate with 117 use std::collections::HashMap;
use std::fmt::Debug;
use std::hash::Hash;
fn map<K: Hash + Debug + Eq + Clone, V: Debug>(k: K, v: V) {
let mut map = HashMap::new();
map.insert(k.clone(), v);
map.reserve(1000);
dbg!(map.get(&k), map.iter().next());
}
fn values<K: Hash + Debug + Eq + Clone>(k: K) {
map(k.clone(), ());
map(k.clone(), "");
map(k.clone(), true);
map(k.clone(), 1i8);
map(k.clone(), 1u8);
map(k.clone(), 1u32);
map(k.clone(), 1i32);
map(k.clone(), vec![1u32]);
map(k.clone(), vec![1i32]);
}
fn main() {
values(());
values("");
values(true);
values(1i8);
values(1u8);
values(1u64);
values(1i64);
values(1usize);
values(1isize);
values(String::new());
values(vec![""]);
values(vec![1u32]);
values(vec![1i32]);
} Results:
|
At this point, I'd like to see some real benchmarks from rust-perf. You should open a PR in rust-lang/rust where you replace hashbrown in std with your branch. That way we can use the rust timer bot to get some numbers on compilation time. (Like we did for rust-lang/rust#77566) |
Make rehashing and resizing less generic This makes the code in `rehash_in_place`, `resize` and `reserve_rehash` less generic on `T`. It also improves the performance of rustc. That performance increase in partially attributed to the use of `#[inline(always)]`. This is the effect on rustc runtime: ``` clap:check 1.9523s 1.9327s -1.00% hashmap-instances:check 0.0628s 0.0624s -0.57% helloworld:check 0.0438s 0.0436s -0.50% hyper:check 0.2987s 0.2970s -0.59% regex:check 1.1497s 1.1402s -0.82% syn:check 1.7004s 1.6851s -0.90% syntex_syntax:check 6.9232s 6.8546s -0.99% winapi:check 8.3220s 8.2857s -0.44% Total 20.4528s 20.3014s -0.74% Summary 4.0000s 3.9709s -0.73% ``` `rustc_driver`'s code size is increased by 0.02%. This is the effect on compile time this has on my [HashMap compile time benchmark](#277 (comment)): ``` hashmap-instances:check 0.0636s 0.0632s -0.61% hashmap-instances:release 33.0166s 32.2487s -2.33% hashmap-instances:debug 7.8677s 7.2012s -8.47% Total 40.9479s 39.5131s -3.50% Summary 1.5000s 1.4430s -3.80% ``` The `hashmap-instances:debug` compile time could be further improved if there was a way to apply `#[inline(always)]` only on release builds.
Inline small functions This adds `#[inline]` to small functions which should be beneficial to inline. rustc compilation performance (code size of `rustc_driver` up by 0.09%): ``` clap:check 1.9486s 1.9416s -0.36% hashmap-instances:check 0.0629s 0.0626s -0.52% helloworld:check 0.0443s 0.0439s -0.69% hyper:check 0.3011s 0.3000s -0.36% regex:check 1.1505s 1.1468s -0.33% syn:check 1.6989s 1.6904s -0.50% syntex_syntax:check 6.8479s 6.8288s -0.28% winapi:check 8.3437s 8.2967s -0.56% Total 20.3979s 20.3108s -0.43% Summary 4.0000s 3.9820s -0.45% ``` This is the effect on compile time this has on my [HashMap compile time benchmark](#277 (comment)): ``` hashmap-instances:check 0.0635s 0.0632s -0.33% hashmap-instances:release 32.0928s 32.4440s +1.09% hashmap-instances:debug 7.2193s 7.2800s +0.84% Total 39.3756s 39.7873s +1.05% Summary 1.5000s 1.5080s +0.54% ``` We saw a 1.6% improvement in rustc's build time for a -20% improvement on `hashmap-instances:release` on rust-lang/rust#87233. So I would expect around a 0.08% regression for rustc's build time from this PR.
I re-ran the benchmarks after #283 landed.
|
The rustc-perf results are unfortunately rather disappointing and I don't feel comfortable merging this PR as it is. This PR muddies the water a bit since it involves two distinct changes: combining the insert/lookup and reducing LLVM IR generated. It would be best to explore each separately so that we get a better understanding of the perf characteristics:
|
I assume you're talking about the instruction counts, if so, may I ask why you seem to value those over wall time benchmarks?
I did verify that the code with dynamic dispatch optimizes to similar x86-64 assembly to code without. That does assume that LLVM's inlining plays along however.
I'm not sure what do you mean here. All 3 lookups on
I do think this makes sense. |
I did an experiment where I put
These are the results. The number are in order and the percentages are relative to the first column.
There doesn't seem to be a significant difference between them (for |
Actually I also looked at the CPU cycles on rustc-perf (
I'm referring to
In that case let's stick with the existing algorithm and just focusing on the inlining optimizations. |
I noticed that there's more crates than improve than regresses on wall time, on both perf runs. I wonder if
The reserve call in the middle is a bit annoying. I worked around it by some cheeky unsafe code though. I'll see how it performs.
I meant that this PR has a smaller algorithm, so it inlines better. I haven't done any pure inlining optimizations here. |
Wall-time can be very noisy when multiple threads are involved (codegen units). |
I tries a less generic #[cfg_attr(feature = "inline-more", inline)]
pub fn insert(&mut self, hash: u64, value: T, hasher: impl Fn(&T) -> u64) -> Bucket<T> {
unsafe {
let index = self.table.mark_insert(hash, &move |table| {
(*(table as *mut RawTableInner<A> as *mut Self)).reserve(1, &hasher)
});
let bucket = self.bucket(index);
bucket.write(value);
bucket
}
} #[inline]
unsafe fn mark_insert(&mut self, hash: u64, reserve_one: &dyn Fn(&mut Self)) -> usize {
let mut index = self.find_insert_slot(hash);
// We can avoid growing the table once we have reached our load
// factor if we are replacing a tombstone. This works since the
// number of EMPTY slots does not change in this case.
let old_ctrl = *self.ctrl(index);
if unlikely(self.growth_left == 0 && special_is_empty(old_ctrl)) {
reserve_one(self);
index = self.find_insert_slot(hash);
}
self.record_item_insert_at(index, old_ctrl, hash);
index
} Compile times take a hit:
rustc performance does improve:
|
Here's the result of the perf run with |
I can see the improvement now. However the inlining is still a problem: we want rustc to have good performance but at the same time we want to limit inlining so that crates still compile reasonably fast. I'm honestly not sure what the best approach to take here is. |
The |
An alternative could be to write |
I rebased and reran some rustc benchmarks. The result seem pretty similar, but
|
☔ The latest upstream changes (presumably #405) made this pull request unmergeable. Please resolve the merge conflicts. |
You've run into the same bug as me. In fact, you cannot check for full buckets and empty and deleted buckets at the same time. You should first check all full buckets up to the moment you find an empty bucket, and only then look for empty and deleted buckets. This is necessary, since there may be collisions in the map |
Here my last attempt master...JustForFun88:hashbrown:one_lookup_other. It gives a performance increase, but very small from 0 to 5 percent. |
src/raw/mod.rs
Outdated
.match_empty_or_deleted() | ||
.lowest_set_bit_nonzero(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check can be moved out of the loop in find_potential_inner
so that it is only executed if we find out that the actual element doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it to do this test on loop exit instead.
// We didn't find the element we were looking for in the group, try to get an | ||
// insertion slot from the group if we don't have one yet. | ||
if likely(insert_slot.is_none()) { | ||
insert_slot = self.find_insert_slot_in_group(&group, &probe_seq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could add a fast path here to immediately continue the loop if the returned insert_slot
is None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would add a conditional to the common case of finding an insertion slot though, and only save group.match_empty()
which is cheap already.
Can you expand on this? It's not clear to me why the code is incorrect. It's essentially merging both loops to run them at the same time. |
Oh, sorry. I did not study the code enough, I just looked at why In principle, here master...JustForFun88:hashbrown:one_lookup_other is an equivalent version that works correctly without early reservation of additional space. It gives an acceleration of up to 10%. We can combine both approaches. Cargo bench
|
@JustForFun88 Isn't your implementation basically the same as what we already had before: first a lookup to find a matching entry, followed by a search for an empty slot if the lookup failed. |
Yes you are right. It all boils down to this one way or another. I tried to do something similar to this pull request (master...JustForFun88:hashbrown:one_lookup_fourth). Didn't get any performance improvement from this. I think this is due to the fact that an additional branch is added to the loop (
let mut group_insert = unsafe { Group::load(self.ctrl(probe_seq.pos)) };
let mut group_find = group_insert;
if unlikely(found) {
// found = true
return (self.bucket(index), found);
} But in general, the acceleration is not great, which is why I closed my PR. |
Have you tried benchmarking your optimization on the rustc perf suite? It's much better to benchmark a real program (rustc itself make heavy use of hashmaps), and we've seen good improvements there that don't show up in microbenchmarks. To do this you will need to build rust from source with hashbrown in the standard library changed to point to your branch. |
☔ The latest upstream changes (presumably #411) made this pull request unmergeable. Please resolve the merge conflicts. |
I just benchmarked this branch against master with rustc-perf and found almost no difference in perf. Summary
Primary benchmarks
Secondary benchmarks
|
The benchmarks do look much better though:
|
I re-ran the benchmarks, this time accounting for inlining changes by force-inlining everything and the results are still good with ~5% speedup. I think this is good to merge!
Just change that benchmark to only insert 223 times instead of 224 times. The issue is that this PR makes |
I updated the benchmark. |
@bors r+ |
Optimize insertion to only use a single lookup This changes the `insert` method to use a single lookup for insertion instead of 2 separate lookups. This reduces runtime of rustc on check builds by an average of 0.5% on [local benchmarks](https://github.com/Zoxc/rcb/tree/93790089032fc3fb4a4d708fb0adee9551125916/benchs) (1% for `winapi`). The compiler size is reduced by 0.45%. `unwrap_unchecked` is lifted from `libstd` since it's currently unstable and that probably requires some attribution.
💔 Test failed - checks-actions |
@bors retry |
☀️ Test successful - checks-actions |
1 similar comment
☀️ Test successful - checks-actions |
This changes the
insert
method to use a single lookup for insertion instead of 2 separate lookups.This reduces runtime of rustc on check builds by an average of 0.5% on local benchmarks (1% for
winapi
). The compiler size is reduced by 0.45%.unwrap_unchecked
is lifted fromlibstd
since it's currently unstable and that probably requires some attribution.