Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove zerocopy from rand #1579

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

dhardy
Copy link
Member

@dhardy dhardy commented Feb 6, 2025

  • Added a CHANGELOG.md entry

Summary

Replace zerocopy dependency with unsafe code (up from 12 to 17 instances).

Add benchmarks for some SIMD / wide types.

Remove two #[inline(never)] attributes which were apparently motivated by benchmark results, but caused more harm than help with the new benches.

Motivation

I'm not a big fan of this, but together with #1575 it removes the dependency on zerocopy v0.8, so is probably an improvement.

Project Safe Transmute

If this project lands safe transmute support into the standard library, we would of course want to use that.

Details

Replacing zerocopy::transmute! with core::mem::transmute is easy and results in identical code generation (tested with StdRng and SmallRng); this reverts a change in #1349.

Replacing the fill impls is more complex but I believe acceptable; this reverts a change in #1502.

In both cases, this would have resulted in a usage of unsafe in a macro where safety depends on a type passed by the macro caller. In the first case I decided to inline the three macro usages while in the second I prefixed the macro name with unsafe_.

Benchmark results

$ cargo bench --bench simd --features simd_support -- --baseline master 
   Compiling rand v0.9.0 (/home/dhardy/projects/rand/rand)
   Compiling rand_distr v0.5.0 (/home/dhardy/projects/rand/rand/rand_distr)
   Compiling benches v0.1.0 (/home/dhardy/projects/rand/rand/benches)
    Finished `bench` profile [optimized] target(s) in 1.38s
     Running benches/simd.rs (target/release/deps/simd-2905efe84e67fa8e)
random_simd/u128        time:   [1.8751 ns 1.8831 ns 1.8948 ns]
                        change: [-0.1321% +0.6261% +1.4131%] (p = 0.12 > 0.05)
                        No change in performance detected.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) high mild
  10 (10.00%) high severe
random_simd/m128i       time:   [1.9753 ns 1.9790 ns 1.9833 ns]
                        change: [+5.4631% +5.6551% +5.8561%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) high mild
  13 (13.00%) high severe
random_simd/m256i       time:   [3.7588 ns 3.7755 ns 3.7931 ns]
                        change: [-0.0698% +0.3828% +0.7685%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  4 (4.00%) high mild
  13 (13.00%) high severe
random_simd/m512i       time:   [6.8739 ns 6.8901 ns 6.9097 ns]
                        change: [+0.1511% +0.3741% +0.6309%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe
random_simd/u64x2       time:   [1.9767 ns 1.9817 ns 1.9875 ns]
                        change: [-72.129% -72.012% -71.890%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) high mild
  10 (10.00%) high severe
random_simd/u32x4       time:   [3.9506 ns 3.9572 ns 3.9651 ns]
                        change: [-50.352% -50.035% -49.827%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  9 (9.00%) high severe
random_simd/u32x8       time:   [3.7498 ns 3.7598 ns 3.7717 ns]
                        change: [-0.0915% +0.3002% +0.8262%] (p = 0.20 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) high mild
  11 (11.00%) high severe
random_simd/u16x8       time:   [3.7647 ns 3.7792 ns 3.7953 ns]
                        change: [-0.0710% +0.6785% +1.3454%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
random_simd/u8x16       time:   [3.7806 ns 3.7950 ns 3.8118 ns]
                        change: [+1.1070% +1.5527% +2.1092%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe

Unfinished business?

The Simd and m128i etc. type generation should be equivalent, but they're not in terms of code; the Simd impls currently use fill to avoid more unsafe code here.

Notice from the above that u32x4, u16x8 and u8x16 are the same size as u128 and m128i but cost about twice as much to generate here. This indicates the fill code may be sub-optimal.

Additionally, the m128i impl performed even worse when transmuting a u128 value (~4.3ns or +%130) which, as far as I can tell, is purely because the u128 value is returned via rax, rdx while the __m128i value is returned via rdx, r10 (with rax equal to the struct address). I don't understand this.

Results show that some Simd types are 2-4 times as expensive as expected
Results in few minor regressions and two large improvements
in benchmarks: -72% time for u64x2, -50% for u32x4.
Code gen is identical and benchmarks unaffected.
…_parts_mut

Mostly code gen appears equivalent, though it affects
inlining of u32x4 gen with SmallRng.
Benchmarks are not significantly affected.
@dhardy dhardy requested a review from josephlr February 6, 2025 12:20
@mitsuhiko
Copy link

Replacing zerocopy::transmute! with core::mem::transmute is easy and results in identical code generation (tested with StdRng and SmallRng); this reverts a change in #1349.

For those cases where you just call zerocopy::transmute! you could still use zerocopy in CI. You could declare an optional dependency to zerocopy and have a macro that switches between the zerocopy transmute for CI and tests and the stdlib one. That way you do get the verification in CI that zerocopy enables.

I have been proposing this for ahash: tkaitchuck/aHash#253

I'm not sure if this is a great idea, but it's I think a compromise that has some value.

@joshlf
Copy link
Contributor

joshlf commented Feb 12, 2025

If this project lands safe transmute support into the standard library, we would of course want to use that.

I should clarify that Project Safe Transmute will likely never replace zerocopy/bytemuck, but just replace their derives (zerocopy-derive and bytemuck-derive). Some very limited functionality may exist directly in the standard library, but we think of Safe Transmute as mostly being a building block that makes it easier to write sound unsafe code, not a building block that permits you to avoid writing unsafe code entirely. I suspect this doesn't change the calculus here, but I figured it was worth mentioning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants