-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "Do not hash leading zero bytes of i64 numbers in Sip128 hasher" #93014
Revert "Do not hash leading zero bytes of i64 numbers in Sip128 hasher" #93014
Conversation
Expecting negative perf impact since it reverts a perf optimization to restore correctness. @bors r+ rollup=never |
📌 Commit cb225f2affc062daa2b05234eeeb152c3ea0ff97 has been approved by |
For the beta backport consideration: Impact is probably low since we have had no issue reports so far that could be traced back to the correcness issue. On the other hand it's a revert so the risk should also be low. |
Why would it be a correctness issue? Predictable hash collisions are a performance problem, possible DoS in user-facing contexts, but should otherwise still behave correctly. |
AFAIK, stable hashing has more strict requirements than normal Hash. Conflicts should happen only with the smallest possible probability. With this implementation, a source code change might produce the same hash and thus break incr. compilation. The chance is probably quite small (maybe even this specific situation can't happen in the current version of rustc, hard to tell), but it should be fixed I think. |
Indeed, that's part of the StableHasher contract. rust/compiler/rustc_data_structures/src/stable_hasher.rs Lines 150 to 154 in 5e57faa
|
The negative case must still be subject to the pigeonhole principle though. You can try to improve that probability, but it can't be guaranteed. i.e. " |
The hashes are |
Yeah, agreed. I'm not arguing against the revert or the beta-backport, just noting that we can't be perfect. |
I'd like us to be clearer in our docs about what the consequences are of imperfection here. I often see arguments of the form "the probability is so low, its never going to happen." And we treat hash collisions as serious bugs in our incr-comp infrastructure. |
There shouldn't be any since - assuming a perfectly uniform hash function - it'd take 264 inputs to get a collision. other bugs and hardware glitches will get you long before that. For all practical purposes unique inputs -> unique outputs. Or do you mean it should be documented what happens when the requirement about the inputs isn't upheld? |
⌛ Testing commit cb225f2affc062daa2b05234eeeb152c3ea0ff97 with merge d9c1bf5eea3c3d6330796bd1f7bdc08347eb7760... |
💔 Test failed - checks-actions |
This comment has been minimized.
This comment has been minimized.
Tests need to be reblessed. |
cb225f2
to
50f8062
Compare
Done. |
@bors r+ |
📌 Commit 50f8062 has been approved by |
☀️ Test successful - checks-actions |
Finished benchmarking commit (d2dc425): comparison url. Summary: This change led to very large relevant regressions 😿 in compiler performance.
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression |
@rustbot label: +perf-regression-triaged
|
…zero-bytes, r=the8472 Revert "Do not hash leading zero bytes of i64 numbers in Sip128 hasher" Reverts rust-lang#92103. It had a (in retrospect, obvious) correctness problem where changing the order of two adjacent values would produce identical hashes, which is problematic in stable hashing (see [this comment](rust-lang#92103 (comment))). I'll try to send the PR again with a fix for this issue. r? `@the8472`
For anyone interested in where this went, this conversation largely moved over to Zulip (though I don't think it necessarily reached a conclusion): https://zulip-archive.rust-lang.org/stream/131828-t-compiler/topic/Ident.20and.20.60PartialEq.60.20.2F.20.60HashStable.60.html#269112656 |
[beta] Backports * Update cargo * rust-lang/cargo#10328 — remove --host flag * rust-lang/cargo#10325 — Fix documenting with undocumented dependencies * rust-lang#93128 — Add script to prevent point releases with same number as existing ones * rust-lang#93014 — Revert "Do not hash leading zero bytes of i64 numbers in Sip128 hasher" * rust-lang#93012 — Update pulldown-cmark version to fix markdown list issue
…sion, r=the8472 Compress amount of hashed bytes for `isize` values in StableHasher This is another attempt to land rust-lang#92103, this time hopefully with a correct implementation w.r.t. stable hashing guarantees. The previous PR was [reverted](rust-lang#93014) because it could produce the [same hash](rust-lang#92103 (comment)) for different values even in quite simple situations. I have since added a basic [test](rust-lang#93193) that should guard against that situation, I also added a new test in this PR, specialised for this optimization. ## Why this optimization helps Since the original PR, I have tried to analyze why this optimization even helps (and why it especially helps for `clap`). I found that the vast majority of stable-hashing `i64` actually comes from hashing `isize` (which is converted to `i64` in the stable hasher). I only found a single place where is this datatype used directly in the compiler, and this place has also been showing up in traces that I used to find out when is `isize` being hashed. This place is `rustc_span::FileName::DocTest`, however, I suppose that isizes also come from other places, but they might not be so easy to find (there were some other entries in the trace). `clap` hashes about 8.5 million `isize`s, and all of them fit into a single byte, which is why this optimization has helped it [quite a lot](rust-lang#92103 (comment)). Now, I'm not sure if special casing `isize` is the correct solution here, maybe something could be done with that `isize` inside `DocTest` or in other places, but that's for another discussion I suppose. In this PR, instead of hardcoding a special case inside `SipHasher128`, I instead put it into `StableHasher`, and only used it for `isize` (I tested that for `i64` it doesn't help, or at least not for `clap` and other few benchmarks that I was testing). ## New approach Since the most common case is a single byte, I added a fast path for hashing `isize` values which positive value fits within a single byte, and a cold path for the rest of the values. To avoid the previous correctness problem, we need to make sure that each unique `isize` value will produce a unique hash stream to the hasher. By hash stream I mean a sequence of bytes that will be hashed (a different sequence should produce a different hash, but that is of course not guaranteed). We have to distinguish different values that produce the same bit pattern when we combine them. For example, if we just simply skipped the leading zero bytes for values that fit within a single byte, `(0xFF, 0xFFFFFFFFFFFFFFFF)` and `(0xFFFFFFFFFFFFFFFF, 0xFF)` would send the same hash stream to the hasher, which must not happen. To avoid this situation, values `[0, 0xFE]` are hashed as a single byte. When we hash a larger (treating `isize` as `u64`) value, we first hash an additional byte `0xFF`. Since `0xFF` cannot occur when we apply the single byte optimization, we guarantee that the hash streams will be unique when hashing two values `(a, b)` and `(b, a)` if `a != b`: 1) When both `a` and `b` are within `[0, 0xFE]`, their hash streams will be different. 2) When neither `a` and `b` are within `[0, 0xFE]`, their hash streams will be different. 3) When `a` is within `[0, 0xFE]` and `b` isn't, when we hash `(a, b)`, the hash stream will definitely not begin with `0xFF`. When we hash `(b, a)`, the hash stream will definitely begin with `0xFF`. Therefore the hash streams will be different. r? `@the8472`
Reverts #92103. It had a (in retrospect, obvious) correctness problem where changing the order of two adjacent values would produce identical hashes, which is problematic in stable hashing (see this comment).
I'll try to send the PR again with a fix for this issue.
r? @the8472