Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: use stable hash from rustc-stable-hash #14116

Closed
wants to merge 2 commits into from

Conversation

weihanglo
Copy link
Member

@weihanglo weihanglo commented Jun 20, 2024

What does this PR try to resolve?

This helps -Ztrim-paths build a stable cross-platform path for the
registry and git sources. Sources files then can be found from the same
path when debugging.

See #13171 (comment)

How should we test and review this PR?

There are a few caveats, and we should do an FCP before merge:

  • This will invalidate the current downloaded caches.
    Need to put this in the Cargo CHANGELOG.
  • As a consequence of changing how SourceId is hashed, the global cache
    tracker is also affected because Cargo writes source identifiers (e.g.
    index.crates.io-6f17d22bba15001f) to SQLite
    .
  • The performance of rustc-stable-hash is slightly worse than the old
    SipHasher in std on short things like SourceId, but for long stuff
    like fingerprint. See Additional information.

StableHasher is used in several places. We should consider if there is a need for cryptographyic hash (see #13171 (comment)).

Additional information

Benchmark on x86_64-unknown-linux-gnu

bench_hasher/RustcStableHasher/URL
                        time:   [33.843 ps 33.844 ps 33.845 ps]
                        change: [-0.0167% -0.0049% +0.0072%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) low severe
  3 (3.00%) high mild
  2 (2.00%) high severe
bench_hasher/SipHasher/URL
                        time:   [18.954 ns 18.954 ns 18.955 ns]
                        change: [-0.1281% -0.0951% -0.0644%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe
bench_hasher/RustcStableHasher/lorem ipsum
                        time:   [659.18 ns 659.20 ns 659.22 ns]
                        change: [-0.0192% -0.0062% +0.0068%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
bench_hasher/SipHasher/lorem ipsum
                        time:   [1.2006 µs 1.2008 µs 1.2010 µs]
                        change: [+0.0117% +0.0467% +0.0808%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Benchmark on aarch64-apple-darwin

Benchmarking bench_hasher/RustcStableHasher/URL: Collecting 1000 samples in estimated 5.0090 s (256M ibench_hasher/RustcStableHasher/URL
                        time:   [19.619 ns 19.645 ns 19.670 ns]
Found 156 outliers among 1000 measurements (15.60%)
  10 (1.00%) low severe
  59 (5.90%) low mild
  43 (4.30%) high mild
  44 (4.40%) high severe
Benchmarking bench_hasher/SipHasher/URL: Collecting 1000 samples in estimated 5.0075 s (279M iterationbench_hasher/SipHasher/URL
                        time:   [17.809 ns 17.826 ns 17.843 ns]
Found 34 outliers among 1000 measurements (3.40%)
  28 (2.80%) high mild
  6 (0.60%) high severe
Benchmarking bench_hasher/RustcStableHasher/300 chars: Collecting 1000 samples in estimated 5.0027 s (bench_hasher/RustcStableHasher/300 chars
                        time:   [95.535 ns 95.679 ns 95.824 ns]
Found 48 outliers among 1000 measurements (4.80%)
  39 (3.90%) high mild
  9 (0.90%) high severe
Benchmarking bench_hasher/SipHasher/300 chars: Collecting 1000 samples in estimated 5.0492 s (34M iterbench_hasher/SipHasher/300 chars
                        time:   [151.18 ns 151.37 ns 151.58 ns]
Found 16 outliers among 1000 measurements (1.60%)
  13 (1.30%) high mild
  3 (0.30%) high severe
Benchmarking bench_hasher/RustcStableHasher/lorem ipsum (3222 chars): Collecting 1000 samples in estimbench_hasher/RustcStableHasher/lorem ipsum (3222 chars)
                        time:   [975.85 ns 976.65 ns 977.50 ns]
Found 92 outliers among 1000 measurements (9.20%)
  48 (4.80%) high mild
  44 (4.40%) high severe
Benchmarking bench_hasher/SipHasher/lorem ipsum (3222 chars): Collecting 1000 samples in estimated 5.3bench_hasher/SipHasher/lorem ipsum (3222 chars)
                        time:   [1.7856 µs 1.7872 µs 1.7888 µs]
Found 66 outliers among 1000 measurements (6.60%)
  47 (4.70%) high mild
  19 (1.90%) high severe
Criterion benchmark script

#![allow(deprecated)]

use std::hash::Hash as _;
use std::hash::Hasher as _;

use criterion::criterion_group;
use criterion::criterion_main;
use criterion::BenchmarkId;
use criterion::Criterion;

struct SipHasher(std::hash::SipHasher);

impl SipHasher {
    fn new() -> SipHasher {
        SipHasher(std::hash::SipHasher::new())
    }
}

impl std::hash::Hasher for SipHasher {
    fn finish(&self) -> u64 {
        self.0.finish()
    }
    fn write(&mut self, bytes: &[u8]) {
        self.0.write(bytes)
    }
}

struct RustcStableHasher(rustc_stable_hash::StableHasher);

impl RustcStableHasher {
    fn new() -> RustcStableHasher {
        RustcStableHasher(rustc_stable_hash::StableHasher::new())
    }

    fn finish(self) -> u64 {
        self.0.finalize().0
    }
}

impl std::hash::Hasher for RustcStableHasher {
    fn finish(&self) -> u64 {
        panic!("call StableHasher::finish instead");
    }

    fn write(&mut self, bytes: &[u8]) {
        self.0.write(bytes)
    }
}

const INPUTS: &[(&'static str, &'static str)] = &[
    ("URL", "registry+https://github.com/rust-lang/crates.io-index"),
    ("300 chars", "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec sem odio, consectetur ac velit ac, hendrerit pulvinar nisl. Aenean auctor felis non accumsan porta. Nullam purus diam, aliquam nec dui vitae, iaculis fermentum eros. Nunc laoreet lectus nec malesuada tristique. Quisque venenatis vehicula"),
    ("lorem ipsum (3222 chars)", "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec sem odio, consectetur ac velit ac, hendrerit pulvinar nisl. Aenean auctor felis non accumsan porta. Nullam purus diam, aliquam nec dui vitae, iaculis fermentum eros. Nunc laoreet lectus nec malesuada tristique. Quisque venenatis vehicula lacus sed auctor. In libero sapien, auctor vulputate tellus ut, scelerisque feugiat neque. Sed feugiat nulla vel lorem tincidunt viverra. Proin blandit pretium sapien id imperdiet. Sed elementum, ligula quis porttitor consectetur, augue ligula consectetur erat, at congue massa tortor in odio. Morbi sit amet tincidunt libero, eu rutrum felis. Integer rhoncus tortor et erat congue venenatis. Proin ac ante sit amet urna tincidunt ullamcorper. Vestibulum nec tincidunt neque. Vestibulum venenatis, libero et blandit pretium, risus nibh efficitur libero, vel condimentum tortor nulla non sapien. Morbi ac dapibus est. Duis justo arcu, laoreet lacinia luctus mollis, placerat non augue. Interdum et malesuada fames ac ante ipsum primis in faucibus. Fusce vestibulum eu tellus in pellentesque. Nam efficitur mattis turpis. Vestibulum a condimentum purus. Suspendisse eget augue scelerisque sem dignissim ornare vitae in augue. Vestibulum porta rhoncus sapien, non luctus nisi vehicula in. Etiam cursus tortor turpis, eu imperdiet purus facilisis ut. Nullam vestibulum erat ex, sit amet commodo est fermentum eleifend. Donec pulvinar imperdiet urna, egestas ultricies mi pulvinar at. Maecenas velit dui, iaculis at egestas eu, consequat sit amet nisl. Ut eu leo ultricies, porttitor ante eu, ultrices massa. Nam commodo, nunc ut mollis egestas, lectus ex eleifend nisl, vitae mollis metus quam vitae sapien. Curabitur eu nulla massa. Vivamus sodales turpis et lorem placerat, ac dignissim nulla luctus. In placerat eleifend orci, dapibus varius felis tincidunt sed. Nulla suscipit mauris condimentum ipsum finibus, ac mattis sapien aliquet. Cras feugiat elementum augue, viverra lacinia ante congue et. Sed et bibendum sem. Aenean pretium tellus eget velit commodo pretium a sit amet velit. Curabitur vitae est vitae nulla venenatis tristique in a eros. In scelerisque lectus et luctus mattis. Cras ac purus ac purus tempor molestie vitae vitae felis. Quisque volutpat elementum felis vitae mollis. Pellentesque finibus quam eget vestibulum tempus. Praesent quis massa eget ligula ultrices lobortis. Ut pellentesque, mi ac finibus sagittis, dui felis tempor dui, ac commodo mauris massa nec dolor. Cras congue, lectus vitae luctus faucibus, massa mauris malesuada elit, et facilisis turpis odio non justo. Proin volutpat turpis quis ante interdum pellentesque. Morbi faucibus, erat vel elementum aliquet, odio leo eleifend magna, sagittis semper lorem mauris nec arcu. Curabitur lacinia sagittis ante mollis facilisis. Fusce ultrices tellus sed justo rhoncus varius ut eu justo. Sed a est purus. Sed nec mi laoreet, consequat justo nec, sodales augue. Nullam posuere ipsum et velit aliquam blandit a quis metus. Aliquam id eros non magna suscipit bibendum. Curabitur porta auctor sapien, a molestie nisl. Donec neque leo, consequat vitae velit sit amet, aliquam elementum purus. Donec sit amet congue mi. Etiam at magna nunc."),
];

fn bench_hasher(c: &mut Criterion) {
    let mut group = c.benchmark_group("bench_hasher");
    group.sample_size(1000);
    for (name, input) in INPUTS {
        let id = BenchmarkId::new("RustcStableHasher", name);
        group.bench_with_input(id, input, |b, input| {
            b.iter(|| {
                let mut hasher = RustcStableHasher::new();
                input.hash(&mut hasher);
                hasher.finish();
            })
        });
        let id = BenchmarkId::new("SipHasher", name);
        group.bench_with_input(id, input, |b, input| {
            b.iter(|| {
                let mut hasher = SipHasher::new();
                input.hash(&mut hasher);
                hasher.finish();
            })
        });
    }
    group.finish();
}

criterion_group!(benches, bench_hasher);
criterion_main!(benches);

@rustbot
Copy link
Collaborator

rustbot commented Jun 20, 2024

r? @epage

rustbot has assigned @epage.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added A-cache-messages Area: caching of compiler messages A-layout Area: target output directory layout, naming, and organization A-registries Area: registries S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 20, 2024
@weihanglo weihanglo changed the title Stable hash feat!: use stable hash from rustc-stable-hash Jun 20, 2024
@weihanglo
Copy link
Member Author

This is blocked on releasing rustc-stable-hash to crates.io

@weihanglo weihanglo added the Z-trim-paths Nightly: path sanitization label Jun 20, 2024
@briansmith
Copy link

From a <1 minute reading of "-Ztrim-paths build a stable cross-platform path for the registry and git sources."

My understanding is that the intent here is to use a hash function to create a stable path to a particular set of source files or artifacts. If there is a hash collision then potentially hash(malicous-sources) == hash(trusted-sources) and so malicous-sources could be used instead of trusted-sources, silently.

Comment on lines 22 to 24
fn write(&mut self, bytes: &[u8]) {
self.0.write(bytes)
}
Copy link
Member

@Urgau Urgau Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure only forwarding Hasher::write is enough, since the endian-ness handling is done on the individual write_{u,i}{8,16,32,64,128} methods and not forwarding those will bypass that endian-ness handling 1.

I think it's also going to bypass the {u,i}size handling.

Footnotes

  1. the default implementation use native endian-ness, instead of a fixed one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rust-lang/rustc-stable-hash#6 and rust-lang/rustc-stable-hash#8, I think StableSipHasher128 should be a drop-in replacement that we can use directly.

I am also thinking of blake3 as an alternative, but haven't figured out how to make it play nice with ExtendedHasher.

This helps `-Ztrim-paths` build a stable cross-platform path for the
registry and git sources. Sources files then can be found from the same
path when debugging.

See rust-lang#13171 (comment)

A few caveats:

* This will invalidate the current downloaded caches.
  Need to put this in the Cargo CHANGELOG.
* As a consequence of changing how `SourceId` is hashed, the global cache
  tracker is also affected because Cargo writes source identifiers (e.g.
  `index.crates.io-6f17d22bba15001f`) to SQLite.
  * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/global_cache_tracker.rs#L388-L391
* The performance of rustc-stable-hash is slightly worse than the old
  SipHasher in std on short things like `SourceId`, but for long stuff
  like fingerprint. See appendix.

StableHasher is used in several places (some might not be needed?):

* Rebuild detection (fingerprints)
  * Rustc version, including all the CLI args running `rustc -vV`.
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/util/rustc.rs#L326
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/util/rustc.rs#L381
  * Build caches
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/fingerprint/mod.rs#L1456
* Compute rustc `-C metadata`
  * stable hash for SourceId
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/package_id.rs#L207
  * Also read and hash contents from custom target JSON file.
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/compile_kind.rs#L81-L91
* `UnitInner::dep_hash`
  * This is to distinguish same units having different features set between normal and build dependencies.
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/ops/cargo_compile/mod.rs#L627
* Hash file contents for `cargo package` to verify if files were modified before and after the build.
  * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/ops/cargo_package.rs#L999
* Rusc diagnostics deduplication
  * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/job_queue/mod.rs#L311
* Places using `SourceId` identifier like `registry/src` path,
  and `-Zscript` target directories.

Appendix
--------

Benchmark on x86_64-unknown-linux-gnu

```
bench_hasher/RustcStableHasher/URL
                        time:   [33.843 ps 33.844 ps 33.845 ps]
                        change: [-0.0167% -0.0049% +0.0072%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) low severe
  3 (3.00%) high mild
  2 (2.00%) high severe
bench_hasher/SipHasher/URL
                        time:   [18.954 ns 18.954 ns 18.955 ns]
                        change: [-0.1281% -0.0951% -0.0644%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe
bench_hasher/RustcStableHasher/lorem ipsum
                        time:   [659.18 ns 659.20 ns 659.22 ns]
                        change: [-0.0192% -0.0062% +0.0068%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
bench_hasher/SipHasher/lorem ipsum
                        time:   [1.2006 µs 1.2008 µs 1.2010 µs]
                        change: [+0.0117% +0.0467% +0.0808%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
```
@rustbot rustbot added the A-rebuild-detection Area: rebuild detection and fingerprinting label Jul 9, 2024
// The hash value depends on endianness and bit-width, so we only run this test on
// little-endian 64-bit CPUs (such as x86-64 and ARM64) where it matches the
// well-known value.
// The hash value should be stable across platforms, and doesn't depend on
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we need to fix, if the goal is a fully cross-platform.

@Urgau
Copy link
Member

Urgau commented Jul 11, 2024

For information, rustc-stable-hash v0.1.0 has now been released on crates.io!

@weihanglo
Copy link
Member Author

Status update:

Had some chats with Urgau, here is data from benchmark blake3/blake2s on rustc

While in Cargo hash is not a dominant factor of performance it still plays a role. A Unit of work in Cargo (usually bound to a rustc invocation) may involve roughly 10-20 units from me reading the source code. So hash function may still impact the build time.

Will need more real world benchmarks for like cargo check on cached build.

@bors
Copy link
Contributor

bors commented Aug 1, 2024

☔ The latest upstream changes (presumably #14334) made this pull request unmergeable. Please resolve the merge conflicts.

@weihanglo weihanglo closed this Oct 9, 2024
@weihanglo weihanglo deleted the stable-hash branch October 9, 2024 00:21
@weihanglo weihanglo restored the stable-hash branch October 23, 2024 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cache-messages Area: caching of compiler messages A-layout Area: target output directory layout, naming, and organization A-rebuild-detection Area: rebuild detection and fingerprinting A-registries Area: registries S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. Z-trim-paths Nightly: path sanitization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants