Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Do not hash leading zero bytes of i64 numbers in Sip128 hasher" #93014

Merged

Conversation

Kobzol
Copy link
Contributor

@Kobzol Kobzol commented Jan 17, 2022

Reverts #92103. It had a (in retrospect, obvious) correctness problem where changing the order of two adjacent values would produce identical hashes, which is problematic in stable hashing (see this comment).

I'll try to send the PR again with a fix for this issue.

r? @the8472

@rustbot rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 17, 2022
@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 17, 2022
@the8472
Copy link
Member

the8472 commented Jan 17, 2022

Expecting negative perf impact since it reverts a perf optimization to restore correctness.

@bors r+ rollup=never

@bors
Copy link
Contributor

bors commented Jan 17, 2022

📌 Commit cb225f2affc062daa2b05234eeeb152c3ea0ff97 has been approved by the8472

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 17, 2022
@the8472 the8472 added perf-regression-triaged The performance regression has been triaged. beta-nominated Nominated for backporting to the compiler in the beta channel. labels Jan 17, 2022
@the8472
Copy link
Member

the8472 commented Jan 17, 2022

For the beta backport consideration: Impact is probably low since we have had no issue reports so far that could be traced back to the correcness issue. On the other hand it's a revert so the risk should also be low.

@cuviper
Copy link
Member

cuviper commented Jan 19, 2022

Why would it be a correctness issue? Predictable hash collisions are a performance problem, possible DoS in user-facing contexts, but should otherwise still behave correctly.

@Kobzol
Copy link
Contributor Author

Kobzol commented Jan 19, 2022

AFAIK, stable hashing has more strict requirements than normal Hash. Conflicts should happen only with the smallest possible probability. With this implementation, a source code change might produce the same hash and thus break incr. compilation. The chance is probably quite small (maybe even this specific situation can't happen in the current version of rustc, hard to tell), but it should be fixed I think.

@the8472
Copy link
Member

the8472 commented Jan 19, 2022

AFAIK, stable hashing has more strict requirements than normal Hash.

Indeed, that's part of the StableHasher contract.

/// - Stable hashes are sometimes used as identifiers. Therefore they must
/// conform to the corresponding `PartialEq` implementations:
///
/// - `x == y` implies `hash_stable(x) == hash_stable(y)`, and
/// - `x != y` implies `hash_stable(x) != hash_stable(y)`.

@cuviper
Copy link
Member

cuviper commented Jan 19, 2022

The negative case must still be subject to the pigeonhole principle though. You can try to improve that probability, but it can't be guaranteed. i.e. "x != y strongly implies hash_stable(x) != hash_stable(y)."

@the8472
Copy link
Member

the8472 commented Jan 19, 2022

The hashes are u128, we don't want to do significantly worse than random collisions. Different structs mapping to the same input to the hash function obviously are collisions, and an entire class thereof.

@cuviper
Copy link
Member

cuviper commented Jan 19, 2022

Yeah, agreed. I'm not arguing against the revert or the beta-backport, just noting that we can't be perfect.

@pnkfelix pnkfelix added the beta-accepted Accepted for backporting to the compiler in the beta channel. label Jan 20, 2022
@apiraino
Copy link
Contributor

Beta backport approved as per compiler team on Zulip

@rustbot label +beta-accepted

@pnkfelix
Copy link
Member

Yeah, agreed. I'm not arguing against the revert or the beta-backport, just noting that we can't be perfect.

I'd like us to be clearer in our docs about what the consequences are of imperfection here. I often see arguments of the form "the probability is so low, its never going to happen." And we treat hash collisions as serious bugs in our incr-comp infrastructure.

@the8472
Copy link
Member

the8472 commented Jan 22, 2022

@pnkfelix

I'd like us to be clearer in our docs about what the consequences are of imperfection here. I often see arguments of the form "the probability is so low, its never going to happen." And we treat hash collisions as serious bugs in our incr-comp infrastructure.

There shouldn't be any since - assuming a perfectly uniform hash function - it'd take 264 inputs to get a collision. other bugs and hardware glitches will get you long before that. For all practical purposes unique inputs -> unique outputs.

Or do you mean it should be documented what happens when the requirement about the inputs isn't upheld?

@bors
Copy link
Contributor

bors commented Jan 23, 2022

⌛ Testing commit cb225f2affc062daa2b05234eeeb152c3ea0ff97 with merge d9c1bf5eea3c3d6330796bd1f7bdc08347eb7760...

@bors
Copy link
Contributor

bors commented Jan 23, 2022

💔 Test failed - checks-actions

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jan 23, 2022
@rust-log-analyzer

This comment has been minimized.

@the8472
Copy link
Member

the8472 commented Jan 23, 2022

Tests need to be reblessed.

@the8472 the8472 added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 23, 2022
@Kobzol Kobzol force-pushed the revert-92103-stable-hash-skip-zero-bytes branch from cb225f2 to 50f8062 Compare January 24, 2022 08:07
@Kobzol
Copy link
Contributor Author

Kobzol commented Jan 24, 2022

Done.

@the8472
Copy link
Member

the8472 commented Jan 24, 2022

@bors r+

@bors
Copy link
Contributor

bors commented Jan 24, 2022

📌 Commit 50f8062 has been approved by the8472

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jan 24, 2022
@bors
Copy link
Contributor

bors commented Jan 24, 2022

⌛ Testing commit 50f8062 with merge d2dc425...

@bors
Copy link
Contributor

bors commented Jan 24, 2022

☀️ Test successful - checks-actions
Approved by: the8472
Pushing d2dc425 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Jan 24, 2022
@bors bors merged commit d2dc425 into rust-lang:master Jan 24, 2022
@rustbot rustbot added this to the 1.60.0 milestone Jan 24, 2022
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (d2dc425): comparison url.

Summary: This change led to very large relevant regressions 😿 in compiler performance.

  • Very large regression in instruction counts (up to 7.9% on incr-full builds of clap-rs check)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression

@rustbot rustbot added the perf-regression Performance regression. label Jan 24, 2022
@Kobzol Kobzol deleted the revert-92103-stable-hash-skip-zero-bytes branch January 24, 2022 16:06
@cuviper
Copy link
Member

cuviper commented Jan 24, 2022

@rustbot label: +perf-regression-triaged

#93014 (comment)

Expecting negative perf impact since it reverts a perf optimization to restore correctness.

ehuss pushed a commit to ehuss/rust that referenced this pull request Jan 25, 2022
…zero-bytes, r=the8472

Revert "Do not hash leading zero bytes of i64 numbers in Sip128 hasher"

Reverts rust-lang#92103. It had a (in retrospect, obvious) correctness problem where changing the order of two adjacent values would produce identical hashes, which is problematic in stable hashing (see [this comment](rust-lang#92103 (comment))).

I'll try to send the PR again with a fix for this issue.

r? `@the8472`
@ehuss ehuss mentioned this pull request Jan 25, 2022
@ehuss ehuss removed the beta-nominated Nominated for backporting to the compiler in the beta channel. label Jan 25, 2022
@ehuss ehuss modified the milestones: 1.60.0, 1.59.0 Jan 25, 2022
@pnkfelix
Copy link
Member

@pnkfelix

I'd like us to be clearer in our docs about what the consequences are of imperfection here. I often see arguments of the form "the probability is so low, its never going to happen." And we treat hash collisions as serious bugs in our incr-comp infrastructure.

There shouldn't be any since - assuming a perfectly uniform hash function - it'd take 264 inputs to get a collision. other bugs and hardware glitches will get you long before that. For all practical purposes unique inputs -> unique outputs.

Or do you mean it should be documented what happens when the requirement about the inputs isn't upheld?

For anyone interested in where this went, this conversation largely moved over to Zulip (though I don't think it necessarily reached a conclusion): https://zulip-archive.rust-lang.org/stream/131828-t-compiler/topic/Ident.20and.20.60PartialEq.60.20.2F.20.60HashStable.60.html#269112656

bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 26, 2022
[beta] Backports

* Update cargo
    * rust-lang/cargo#10328 — remove --host flag
    * rust-lang/cargo#10325 — Fix documenting with undocumented dependencies
* rust-lang#93128 — Add script to prevent point releases with same number as existing ones
* rust-lang#93014 — Revert "Do not hash leading zero bytes of i64 numbers in Sip128 hasher"
* rust-lang#93012 — Update pulldown-cmark version to fix markdown list issue
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 3, 2022
…sion, r=the8472

Compress amount of hashed bytes for `isize` values in StableHasher

This is another attempt to land rust-lang#92103, this time hopefully with a correct implementation w.r.t. stable hashing guarantees. The previous PR was [reverted](rust-lang#93014) because it could produce the [same hash](rust-lang#92103 (comment)) for different values even in quite simple situations. I have since added a basic [test](rust-lang#93193) that should guard against that situation, I also added a new test in this PR, specialised for this optimization.

## Why this optimization helps
Since the original PR, I have tried to analyze why this optimization even helps (and why it especially helps for `clap`). I found that the vast majority of stable-hashing `i64` actually comes from hashing `isize` (which is converted to `i64` in the stable hasher). I only found a single place where is this datatype used directly in the compiler, and this place has also been showing up in traces that I used to find out when is `isize` being hashed. This place is `rustc_span::FileName::DocTest`, however, I suppose that isizes also come from other places, but they might not be so easy to find (there were some other entries in the trace). `clap` hashes about 8.5 million `isize`s, and all of them fit into a single byte, which is why this optimization has helped it [quite a lot](rust-lang#92103 (comment)).

Now, I'm not sure if special casing `isize` is the correct solution here, maybe something could be done with that `isize` inside `DocTest` or in other places, but that's for another discussion I suppose. In this PR, instead of hardcoding a special case inside `SipHasher128`, I instead put it into `StableHasher`, and only used it for `isize` (I tested that for `i64` it doesn't help, or at least not for `clap` and other few benchmarks that I was testing).

## New approach
Since the most common case is a single byte, I added a fast path for hashing `isize` values which positive value fits within a single byte, and a cold path for the rest of the values.

To avoid the previous correctness problem, we need to make sure that each unique `isize` value will produce a unique hash stream to the hasher. By hash stream I mean a sequence of bytes that will be hashed (a different sequence should produce a different hash, but that is of course not guaranteed).

We have to distinguish different values that produce the same bit pattern when we combine them. For example, if we just simply skipped the leading zero bytes for values that fit within a single byte, `(0xFF, 0xFFFFFFFFFFFFFFFF)` and `(0xFFFFFFFFFFFFFFFF, 0xFF)` would send the same hash stream to the hasher, which must not happen.

To avoid this situation, values `[0, 0xFE]` are hashed as a single byte. When we hash a larger (treating `isize` as `u64`) value, we first hash an additional byte `0xFF`. Since `0xFF` cannot occur when we apply the single byte optimization, we guarantee that the hash streams will be unique when hashing two values `(a, b)` and `(b, a)` if `a != b`:
1) When both `a` and `b` are within `[0, 0xFE]`, their hash streams will be different.
2) When neither `a` and `b` are within `[0, 0xFE]`, their hash streams will be different.
3) When `a` is within `[0, 0xFE]` and `b` isn't, when we hash `(a, b)`, the hash stream will definitely not begin with `0xFF`. When we hash `(b, a)`, the hash stream will definitely begin with `0xFF`. Therefore the hash streams will be different.

r? `@the8472`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beta-accepted Accepted for backporting to the compiler in the beta channel. merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. perf-regression-triaged The performance regression has been triaged. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.