-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking issue for custom hashers in HashMap #27713
Comments
I don't know much about hashing functions, but I just want to say that this one is very important for performances. The SipHasher is very slow, and always eats up between 5% and 20% of the CPU time of each of my projects. By comparison the FnvHasher is around 6 times faster on my machine according to some quick benchmarking. The purpose of the SipHasher is to protect against DDoSes, but the vast majority of the applications written in Rust never even open a socket. |
SipHash looks good on long enough data, but it is slow on small values like integers. Its fixed four round finalization at the end is an example of overhead that is comparatively larger when hashing small values. |
See: https://github.com/shepmaster/twox-hash/blob/master/README.md for scaling comparison. Based on these results (which I haven't vetted for quality), one can conclude that SipHash is actually a pretty good "general purpose" hasher. Fnv doesn't scale as well to longer strings (> 32 bytes), and XXHash doesn't scale well to smaller strings (< 32 bytes). |
(XXhash was constructed to be a fast checksum for gigabytes of lz4 data, so that's what it is good at) |
If |
Just noting some discussion that was tangentially related to this: The current hasher infra heavily penalizes Farmhash (and cityhash, and murmurhash), which wants to branch on the size of the input to choose the "optimal" algorithm. Community's current solution is to just buffer in a Vec until However more generally it would be nice (even for SipHash) to be able to to indicate to the hasher "there is only one thing to hash, here it is, now give me the final hash". Sketch of design: add: pub trait Hasher {
...
/// Hash only the given bytes, and immediately finalize.
/// Results are unspecified if other bytes were previously written to this hasher
fn write_only(&mut self, bytes: &[u8]) -> u64 {
self.write(bytes);
self.finish()
}
}
pub trait Hash {
...
/// Hashes only this value
fn hash_one_shot<H: Hasher>(&self, state: &mut H) -> u64 {
self.hash(state);
state.finish()
}
} So far, so pointless. But back-compat! But now hashers can override write_only to something optimized, and Hash impls can specialize in the following way: // #[derive]d impl
impl Hash {
fn hash() { ... } // same-old
#[if(only_has_one_field)] // magic
fn hash_one_shot<H: Hasher>(&self, state: &mut H) -> u64 {
self.only_field.hash_one_shot(state)
}
} And key types like fn hash_one_shot<H: Hasher>(&self, state: &mut H) -> u64 {
state.write_only(self.as_slice())
} However we must be wary of #5257! In particular, slices and strs must mix in some "bonus" value so that This seems really sketchy, but I can't think of a situation where this would actually break things. You will get potentially curious behaviour, where I've been flying around the country and haven't slept though. |
Incidentally, this proposal soft-fixes #27108 (&[u8] and &str would indeed hash_one_shot to equal values). |
To me, the naming of |
@gankro I'm exploring a slightly different design here: https://github.com/ranma42/rust-hash/tree/master/src Besides from the naming, which can obviously be made more consistent with that currently used in Rust, the main difference is that the stream-oriented part of the code is aggressively inlined. Ideally I would want the implementation of one-shot digest to be automatically generated by the compiler, which should be able to get rid of the streaming overhead simply with constant folding and dead code elimination. This actually happens in the branch I posted, namely the The usual structure of hash functions makes me wonder whether there is much to gain by having |
An important point is to allow setting the seed for the HashState (or the equivalent). Some applications (like simulators) must have reproducible executions, so its necessary to specify the same seed for the hasher to get the same iteration order on HashSet and HashMap. |
@malbarbo Yes, this is well-handled by the with_hash_state method. You make your own state with the desired seed, and then pass it to the HashMap. |
@gankro the whole problem here is that |
Nominating for 1.6 discussion |
CC @shepmaster As one of the hash-thing maintainers, are you happy with this API? |
🔔 This issue is now entering its cycle-long final comment period for stabilization 🔔 The libs team decided that this API is likely ready for stabilization pending a final audit of the ergonomics and usage. If anything arises it will be bumped out of FCP for a later cycle. |
From the point-of-view of a consumer of the API, I've found it to be pleasing enough to use, especially because of the use std::collections::HashMap;
use twox_hash::RandomXxHashState;
let mut hash: HashMap<_, _, RandomXxHashState> = Default::default();
hash.insert(42, "the answer");
assert_eq!(hash.get(&42), Some(&"the answer")); From an implementor point-of-view, I'll softly agree with @ranma42's point:
In my mind, the state of the hash is the internal bits-and-bobs that change every time you add more data to be hashed. In the usage of twox-hash, it's really more like the "state" is a seed and the "hasher" contains the real state. However, looking back from a user POV, neither "state" nor "seed" sound great when you write the declaration ( Naming is hard! Falling into Java naming land, the parameter to the From a pragmatic POV, there will probably never be more than ~100 implementations of hash algorithms, so it's a very small number of people that will have to struggle with this. That may mean it's not worth spending enormous amounts of time on this naming. There will be many more people who use a custom hash algorithm, so as long as the implementations have a nice bit of example of "use it like this", it will probably be just fine. |
Thanks for weighing in @shepmaster! I agree that the naming here perhaps isn't the best, although I am also at a bit of a loss of what would fit as a good name. For construction of a custom hash map, though, you can even use |
@alexcrichton I mentioned that context ( In the stub library I wrote, I named Would it still be possible to rename these traits (and maybe their methods)? Although the names I proposed may not be much easier to understand, they would align with the existing nomenclature. This would at least make them easier to use for people which is already using them in other languages (and it would provide nice consistency for anybody writing library bindings/wrappers). |
Hm yeah Regardless I think it's still possible to tweak some naming here. If larger changes happen, however, then I think we'll have to punt on this until another cycle. |
@alexcrichton Yes, in the In my stub library, the object hierarchy is designed so that the additional state (seeds, initialization vectors, number of ruounds/passes, any parameters) of the hashing algorithm belongs to the object implementing the This has the convenient advantage that in most cases the initialisation data does not need to be stored in the context (for example, the I did some more work on this, but I did not clean it up and push it to the public repo, but since there seem to be some interest, I will try to get around to it this weekend. |
I guess I'd disagree. For example, I could see some code like: trait BuildHasher {
type Hasher;
fn build_hasher(&self) -> Self::Hasher;
}
struct MyCoolHash {
seed: u32,
stream: u32,
}
struct MyCoolHashBuilder {
seed: u32,
stream: u32,
}
impl MyCoolHashBuilder {
fn new() -> Self {
MyCoolHashBuilder {
seed: 0, // make a random seed
stream: 1, // make a random stream
}
}
fn seed(self, seed: u32) -> Self {
MyCoolHashBuilder {
seed: seed,
..self
}
}
fn stream(self, stream: u32) -> Self {
MyCoolHashBuilder {
stream: stream,
..self
}
}
}
impl BuildHasher for MyCoolHashBuilder {
type Hasher = MyCoolHash;
fn build_hasher(&self) -> Self::Hasher {
MyCoolHash {
seed: self.seed,
stream: self.stream,
}
}
}
trait BuildHasher {
type Hasher;
fn build_hasher(&self) -> Self::Hasher;
}
struct MyCoolHash {
seed: u32,
stream: u32,
}
struct MyCoolHashBuilder {
seed: u32,
stream: u32,
}
impl MyCoolHashBuilder {
fn new() -> Self {
MyCoolHashBuilder {
seed: 0, // make a random seed
stream: 1, // make a random stream
}
}
fn seed(self, seed: u32) -> Self {
MyCoolHashBuilder {
seed: seed,
..self
}
}
fn stream(self, stream: u32) -> Self {
MyCoolHashBuilder {
stream: stream,
..self
}
}
}
impl BuildHasher for MyCoolHashBuilder {
type Hasher = MyCoolHash;
fn build_hasher(&self) -> Self::Hasher {
MyCoolHash {
seed: self.seed,
stream: self.stream,
}
}
}
fn main() {
use std::collections::HashMap;
let x = MyCoolHashBuilder::new().seed(42).stream(1);
HashMap::with_hasher(x);
} There's two parts to the builder pattern - the interesting and unique configuration and the actual building. The The biggest downside I see is that I'd expect that most hashing is going to have a small number of knobs to tweak, which means the number of configuration methods would be small (or zero!). |
Oh interesting! That's actually a good point that the trait could be considered the "final step" rather than the configuration up front, and along those lines I'd be pretty cool with |
@alexcrichton I'd still argue for the proposed |
I suppose I personally prefer the name "hash builder" or We don't have that many instances of "make" in libstd I think, the ones I could find are:
I feel that "build", however, at least does show up throughout libstd |
I agree that (That said, I won't stand strongly in the way of |
We're avoiding factory for no reason at all, that's fine, builder can be our own factory. I think BuildHasher is better than MakeHasher. However, no reason to make a builder style API for it just because of the name. Utility first. |
🔔 This issue is now entering its final comment period for stabilization 🔔 The final piece we believe warrants discussion is the naming of the trait itself, of which the two leading candidates seem to be |
Is it too late to submit Womb as Rust's version of a Factory? I think this would definitely resolve any and all confusion with regards to the semantics. Definitely. |
I feel like |
I think the name |
I think it does have to be random forever, at least to some extent. Changing the default would undermine the safety of applications that use HashMap. If you accept that, guaranteeing it in the name seems kind've reasonable, as a warning. |
Also: I maintain that |
The libs team discussed this in triage recently and the decision was to stabilize with the names |
@alexcrichton I had two reasons for preferring Then again, I agree that |
This commit implements the stabilization of the custom hasher support intended for 1.7 but left out due to some last-minute questions that needed some decisions. A summary of the actions done in this PR are: Stable * `std::hash::BuildHasher` * `BuildHasher::Hasher` * `BuildHasher::build_hasher` * `std::hash::BuildHasherDefault` * `HashMap::with_hasher` * `HashMap::with_capacity_and_hasher` * `HashSet::with_hasher` * `HashSet::with_capacity_and_hasher` * `std::collections::hash_map::RandomState` * `RandomState::new` Deprecated * `std::collections::hash_state` * `std::collections::hash_state::HashState` - this trait was also moved into `std::hash` with a reexport here to ensure that we can have a blanket impl to prevent immediate breakage on nightly. Note that this is unstable in both location. * `HashMap::with_hash_state` - renamed * `HashMap::with_capacity_and_hash_state` - renamed * `HashSet::with_hash_state` - renamed * `HashSet::with_capacity_and_hash_state` - renamed Closes rust-lang#27713
This commit implements the stabilization of the custom hasher support intended for 1.7 but left out due to some last-minute questions that needed some decisions. A summary of the actions done in this PR are: Stable * `std::hash::BuildHasher` * `BuildHasher::Hasher` * `BuildHasher::build_hasher` * `std::hash::BuildHasherDefault` * `HashMap::with_hasher` * `HashMap::with_capacity_and_hasher` * `HashSet::with_hasher` * `HashSet::with_capacity_and_hasher` * `std::collections::hash_map::RandomState` * `RandomState::new` Deprecated * `std::collections::hash_state` * `std::collections::hash_state::HashState` - this trait was also moved into `std::hash` with a reexport here to ensure that we can have a blanket impl to prevent immediate breakage on nightly. Note that this is unstable in both location. * `HashMap::with_hash_state` - renamed * `HashMap::with_capacity_and_hash_state` - renamed * `HashSet::with_hash_state` - renamed * `HashSet::with_capacity_and_hash_state` - renamed Closes #27713
This commit implements the stabilization of the custom hasher support intended for 1.7 but left out due to some last-minute questions that needed some decisions. A summary of the actions done in this PR are: Stable * `std::hash::BuildHasher` * `BuildHasher::Hasher` * `BuildHasher::build_hasher` * `std::hash::BuildHasherDefault` * `HashMap::with_hasher` * `HashMap::with_capacity_and_hasher` * `HashSet::with_hasher` * `HashSet::with_capacity_and_hasher` * `std::collections::hash_map::RandomState` * `RandomState::new` Deprecated * `std::collections::hash_state` * `std::collections::hash_state::HashState` - this trait was also moved into `std::hash` with a reexport here to ensure that we can have a blanket impl to prevent immediate breakage on nightly. Note that this is unstable in both location. * `HashMap::with_hash_state` - renamed * `HashMap::with_capacity_and_hash_state` - renamed * `HashSet::with_hash_state` - renamed * `HashSet::with_capacity_and_hash_state` - renamed Closes rust-lang#27713
This is a tracking issue for the unstable
hashmap_hasher
feature in the standard library. This provides the ability to create a HashMap with a custom hashing implementation that abides by theHashState
trait. This has already been used quite a bit in the compiler itself as well as in Servo I believe.Some notable points to consider:
HashState
trait really necessary?HashState
correct?Default
appropriate here?new
constructor be leveraged to create hash maps that use a hasher implementingDefault
? Right now thenew
constructor only works withRandomState
.Hasher
implementation? In theory it should be quite ergonomic.cc @gankro
The text was updated successfully, but these errors were encountered: