-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a dedicated length-prefixing method to Hasher
#94598
Conversation
This seems like a bad idea. It's getting better performance by skipping over some relevant bits of information. By the same reasoning, you might skip the first element when hashing an It's not just about DoS resistance, data such as |
No, I wouldn't say that. Rather the point is to hash enough of the entropy to get the collision rate acceptably low. It's sometimes fine to skip lengths the same way it's sometimes fine to skip big-but-low-cardinality fields in rustc has found that a lower-quality but faster hash is worth it, and it's entirely possible that the tradeoff in skipping lengths would also be a net win. Something like And because of It can potentially also provide value even if nobody ever omits the length entirely. For example, a byte-oriented hasher might choose to LEB128 the length prefixes to save hash calls in common cases, but not do that for all |
Some good points, but I want to address a few of them:
This comparison seems to be about performance of SipHash vs another hash. SipHash is just slow. I agree there are better hashers for many (most) applications, but you can get much faster hashers that don't skip anything.
Agreed, but this seems like a fundamental design choice / design issue with hashing arrays and Borrow, and this proposed API works around it in a brittle way which only works for certain hashers, rather than really addressing the fundamental problem with how constant-sized arrays are hashed.
Not really -- Strings are not prefix free. Good hashers can currently assume that the data they are hashing is already prefix free, so they don't need to add a terminator themselves, they can just pad the data with zeroes to a block size without any delimiters. This wouldn't work without the extra byte. Currently
Another solution without having to complicate the API would be for |
I absolutely want to try it for But always doing it feels wrong. It's probably a win for anything that works byte-by-byte, but rustc's So the extra LEB128 work would be entirely wasteful overhead for it -- even if it's only a 1-byte Which sounds to me like all the more reason to leave it up to the There are middle-ground possibilities here too. If you're implementing a hasher that uses a 4-byte block, then for platforms where |
The extra work is just one comparison against 128, or a single bitand, with a very predictable branch (the branch can assume it's always small). Avoiding this comparison (only in some non-standard If the length is 128 or more, the overhead for length doesn't matter because hashing 128 things is going to dominate (except for ZSTs). Alternatively, the following encoding could be even better: one byte for lengths smaller than 255, 9 bytes (255u8, length) for lengths 255 or more. |
☔ The latest upstream changes (presumably #95552) made this pull request unmergeable. Please resolve the merge conflicts. |
I'm removing T-compiler from the teams involved for review since it looks like a T-libs thing. In case feel free to add the label again. @rustbot label -T-compiler |
86aa700
to
ae6baf0
Compare
Rebased. @joshtriplett friendly 1-month review ping. Let me know if you'd rather I bring it up in a libs-api meeting. |
cc @Amanieu, who was mentioned in the libs-api meeting today as perhaps interested in commenting on this. |
r? @Amanieu |
@bors r+ rollup=never |
📌 Commit ae6baf0e0ba28a27734dc24786a29083015e9f33 has been approved by |
⌛ Testing commit ae6baf0e0ba28a27734dc24786a29083015e9f33 with merge 5036d0075d5da4806f69242f88e0241494512937... |
This comment has been minimized.
This comment has been minimized.
💔 Test failed - checks-actions |
ae6baf0
to
24e5585
Compare
…errors Put the incompatible_closure_captures lint messages in alphabetical order Looks like they were in hash order before, which was causing me trouble in rust-lang#94598, so this PR sorts the errors by trait name.
609977f
to
6b386f5
Compare
This comment has been minimized.
This comment has been minimized.
This accomplishes two main goals: - Make it clear who is responsible for prefix-freedom, including how they should do it - Make it feasible for a `Hasher` that *doesn't* care about Hash-DoS resistance to get better performance by not hashing lengths This does not change rustc-hash, since that's in an external crate, but that could potentially use it in future.
We might want to change the default before stabilizing (or maybe even after), but for getting in the new unstable methods, leave it as-is for now. That way it won't break cargo and such.
6b386f5
to
ebdcb08
Compare
@bors r+ |
📌 Commit ebdcb08 has been approved by |
☀️ Test successful - checks-actions |
Finished benchmarking commit (8c4fc9d): comparison url. Summary:
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. @rustbot label: -perf-regression Footnotes |
…errors Put the incompatible_closure_captures lint messages in alphabetical order Looks like they were in hash order before, which was causing me trouble in rust-lang#94598, so this PR sorts the errors by trait name.
…errors Put the incompatible_closure_captures lint messages in alphabetical order Looks like they were in hash order before, which was causing me trouble in rust-lang#94598, so this PR sorts the errors by trait name.
…errors Put the incompatible_closure_captures lint messages in alphabetical order Looks like they were in hash order before, which was causing me trouble in rust-lang#94598, so this PR sorts the errors by trait name.
…anieu Further elaborate the lack of guarantees from `Hasher` I realized that I got too excited in rust-lang#94598 by adding new methods, and forgot to do the documentation to really answer the core question in rust-lang#94026. This PR just has that doc update. r? `@Amanieu`
/// writing its contents to this `Hasher`. That way | ||
/// `(collection![1, 2, 3], collection![4, 5])` and | ||
/// `(collection![1, 2], collection![3, 4, 5])` will provide different | ||
/// sequences of values to the `Hasher` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lost a full-stop .
This accomplishes two main goals:
Hasher
that doesn't care about Hash-DoS resistance to get better performance by not hashing lengthsThis does not change rustc-hash, since that's in an external crate, but that could potentially use it in future.
Fixes #94026
r? rust-lang/libs
The core of this change is the following two new methods on
Hasher
:With updates to the
Hash
implementations for slices and containers to callwrite_length_prefix
instead ofwrite_usize
.write_str
defaults to usingwrite_length_prefix
since, as was pointed out in the issue, thewrite_u8(0xFF)
approach is insufficient for hashers that work in chunks, as those would hash"a\u{0}"
and"a"
to the same thing. But sinceSipHash
works byte-wise (there's an internal buffer to accumulate bytes until a full chunk is available) it overrideswrite_str
to continue to use the add-non-UTF-8-byte approach.Compatibility:
Because the default implementation of
write_length_prefix
callswrite_usize
, the changed hash implementation for slices will do the same thing the old one did on existingHasher
s.