Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argon2: add parallelism #547

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

jonasmalacofilho
Copy link

@jonasmalacofilho jonasmalacofilho commented Jan 13, 2025

Adds a default-enabled parallel feature, with an otherwise optional dependency on rayon, and parallelize the filling of blocks using the memory views mentioned above.

Coordinated shared access in the memory blocks is implemented with a SegmentViewIter iterator, which implements either rayon::iter::ParallelIterator or core::iter::Iterator and returns SegmentView views into the Argon2 blocks memory that are safe to be used in parallel.

The views alias in the regions that are read-only, but are disjoint in the regions where mutation happens. Effectively, they implement, with a combination of mutable borrowing and runtime checking, the cooperative contract outlined in RFC 9106. This is similar to what was suggested in #380.

To avoid aliasing mutable references into the entire buffer of blocks (which would be UB), pointers are used up to the moment where a reference (shared or mutable) into a specific block is returned. At that point, aliasing is no longer possible.

The following tests have been tried in and pass Miri (modulo unrelated warnings):

reference_argon2i_v0x13_2_8_2
reference_argon2id_v0x13_2_8_2

(Running these in Miri is quite slow, taking ~5 minutes each, so I only ran the most obviously relevant tests for now).

Finally, the alignment of Blocks increases to 128 bytes for better prevention of false sharing on modern platforms. The new value is based on notes on crossbeam-utils::CachePadded.


I also took some inspiration from an intermediate snapshot of #247, before the parallel implementation was removed, as well as from an implementation without any safe abstractions I just worked on for the rust-argon2 crate (sru-systems/rust-argon2#56).

@newpavlov
Copy link
Member

Could you benchmark the parallel implementation and compare it against the single threaded one?

@jonasmalacofilho

This comment was marked as outdated.

Comment on lines +345 to +354
memory_blocks
.segment_views(slice, lanes)
.for_each(|mut memory_view| {
let lane = memory_view.lane();
Copy link
Author

@jonasmalacofilho jonasmalacofilho Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this fill_blocks diff is very noisy, due to a necessary indentation change + rustfmt + diff getting somewhat lost.

The only changes to this function are the use of the segment view iterator (here), the accessing of memory through the segment view API instead of through indexing of the memory_blocks slice (bellow), and changing memory_blocks to be mutable (above).

@jonasmalacofilho
Copy link
Author

master...HEAD with parallel feature
argon2i V0x10           time:   [21.324 ms 21.344 ms 21.371 ms]                          
                        change: [-0.3322% -0.1068% +0.0761%] (p = 0.34 > 0.05)
                        No change in performance detected.

argon2i V0x13           time:   [21.429 ms 21.447 ms 21.471 ms]                          
                        change: [+0.0329% +0.2197% +0.3896%] (p = 0.01 < 0.05)
                        Change within noise threshold.

argon2d V0x10           time:   [21.302 ms 21.322 ms 21.348 ms]                          
                        change: [+0.6139% +0.8010% +0.9679%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2d V0x13           time:   [21.367 ms 21.384 ms 21.408 ms]                          
                        change: [+1.8140% +1.9978% +2.1628%] (p = 0.00 < 0.05)
                        Performance has regressed.

argon2id V0x10          time:   [21.361 ms 21.379 ms 21.405 ms]                           
                        change: [+1.2980% +1.4700% +1.6321%] (p = 0.00 < 0.05)
                        Performance has regressed.

argon2id V0x13          time:   [21.303 ms 21.320 ms 21.342 ms]                           
                        change: [+0.9147% +1.1631% +1.3556%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2id V0x13 m=2048 t=4 p=4                                                                             
                        time:   [1.6939 ms 1.6979 ms 1.7026 ms]
                        change: [-58.795% -58.661% -58.490%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=16384 t=4 p=4                                                                            
                        time:   [11.230 ms 11.309 ms 11.391 ms]
                        change: [-67.907% -67.695% -67.447%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=65536 t=4 p=4                                                                            
                        time:   [44.778 ms 45.122 ms 45.489 ms]
                        change: [-71.067% -70.867% -70.621%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=262144 t=4 p=4                                                                            
                        time:   [172.61 ms 173.58 ms 174.61 ms]
                        change: [-72.478% -72.337% -72.127%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=2 p=4                                                                            
                        time:   [11.964 ms 12.047 ms 12.132 ms]
                        change: [-69.521% -69.311% -69.093%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=8 p=4                                                                            
                        time:   [45.011 ms 45.311 ms 45.623 ms]
                        change: [-69.838% -69.634% -69.434%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=16 p=4                                                                            
                        time:   [88.879 ms 89.461 ms 90.061 ms]
                        change: [-69.861% -69.687% -69.482%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=24 p=4                                                                            
                        time:   [133.26 ms 134.09 ms 134.93 ms]
                        change: [-69.816% -69.628% -69.446%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=1                                                                            
                        time:   [8.1242 ms 8.1254 ms 8.1268 ms]
                        change: [+1.4099% +1.4320% +1.4529%] (p = 0.00 < 0.05)
                        Performance has regressed.

argon2id V0x13 m=2048 t=8 p=2                                                                             
                        time:   [4.8775 ms 4.9057 ms 4.9336 ms]
                        change: [-39.640% -39.331% -38.984%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=4                                                                             
                        time:   [3.2967 ms 3.3045 ms 3.3137 ms]
                        change: [-59.213% -59.105% -58.995%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=6                                                                             
                        time:   [2.5706 ms 2.5757 ms 2.5827 ms]
                        change: [-68.446% -68.385% -68.301%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=8                                                                             
                        time:   [2.1205 ms 2.1339 ms 2.1500 ms]
                        change: [-73.975% -73.809% -73.631%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=12                                                                             
                        time:   [1.8220 ms 1.8515 ms 1.8819 ms]
                        change: [-77.377% -76.954% -76.482%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=16                                                                             
                        time:   [2.2035 ms 2.2221 ms 2.2437 ms]
                        change: [-73.287% -73.088% -72.841%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=64                                                                             
                        time:   [2.2370 ms 2.2553 ms 2.2788 ms]
                        change: [-74.567% -74.380% -74.087%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=1                                                                            
                        time:   [74.181 ms 74.228 ms 74.292 ms]
                        change: [-0.8519% -0.7318% -0.6115%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2id V0x13 m=32768 t=4 p=2                                                                            
                        time:   [39.565 ms 39.759 ms 39.980 ms]
                        change: [-47.750% -47.455% -47.143%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=4                                                                            
                        time:   [23.032 ms 23.199 ms 23.368 ms]
                        change: [-69.607% -69.389% -69.150%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=6                                                                            
                        time:   [18.127 ms 18.171 ms 18.214 ms]
                        change: [-75.369% -75.303% -75.239%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=8                                                                            
                        time:   [14.412 ms 14.439 ms 14.471 ms]
                        change: [-80.442% -80.403% -80.360%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=12                                                                            
                        time:   [11.878 ms 12.021 ms 12.200 ms]
                        change: [-83.827% -83.654% -83.390%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=16                                                                            
                        time:   [14.359 ms 14.388 ms 14.423 ms]
                        change: [-80.504% -80.462% -80.415%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=64                                                                            
                        time:   [12.239 ms 12.285 ms 12.343 ms]
                        change: [-83.542% -83.480% -83.391%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=1                                                                            
                        time:   [652.11 ms 652.26 ms 652.40 ms]
                        change: [-6.4332% -6.4049% -6.3769%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=2                                                                            
                        time:   [337.65 ms 338.01 ms 338.40 ms]
                        change: [-51.454% -51.401% -51.345%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=4                                                                            
                        time:   [178.52 ms 179.41 ms 180.40 ms]
                        change: [-74.218% -74.087% -73.947%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=6                                                                            
                        time:   [137.57 ms 139.27 ms 141.00 ms]
                        change: [-80.074% -79.832% -79.558%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=8                                                                            
                        time:   [136.21 ms 136.41 ms 136.64 ms]
                        change: [-80.298% -80.265% -80.231%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=12                                                                            
                        time:   [119.20 ms 120.03 ms 121.02 ms]
                        change: [-82.675% -82.535% -82.391%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=16                                                                            
                        time:   [146.64 ms 147.06 ms 147.47 ms]
                        change: [-78.611% -78.557% -78.499%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=64                                                                            
                        time:   [131.18 ms 131.41 ms 131.64 ms]
                        change: [-80.804% -80.771% -80.735%] (p = 0.00 < 0.05)
                        Performance has improved.

Note: 6-core CPU with SMT.

master...HEAD without parallel feature, default param tests only
argon2i V0x10           time:   [21.365 ms 21.390 ms 21.419 ms]                          
                        change: [-0.9417% -0.7019% -0.4585%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2i V0x13           time:   [21.523 ms 21.548 ms 21.574 ms]                          
                        change: [+0.0241% +0.2325% +0.4389%] (p = 0.03 < 0.05)
                        Change within noise threshold.

argon2d V0x10           time:   [21.201 ms 21.220 ms 21.243 ms]                          
                        change: [-0.6101% -0.4179% -0.2436%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2d V0x13           time:   [21.403 ms 21.426 ms 21.453 ms]                          
                        change: [+0.3981% +0.6366% +0.8608%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2id V0x10          time:   [21.241 ms 21.258 ms 21.279 ms]                           
                        change: [-1.7410% -1.5319% -1.3262%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13          time:   [21.319 ms 21.335 ms 21.355 ms]                           
                        change: [-0.9682% -0.7757% -0.5904%] (p = 0.00 < 0.05)
                        Change within noise threshold.

@tarcieri
Copy link
Member

@jonasmalacofilho if you can rebase I added cargo careful in #553 which should help spot issues in unsafe code

Coordinated shared access in the memory blocks is implemented with
`SegmentViewIter` and associated types, which provide views into the
Argon2 blocks memory that can be processed in parallel.

These views alias in the regions that are read-only, but are disjoint in
the regions where mutation happens. Effectively, they implement, with a
combination of mutable borrowing and runtime checking, the cooperative
contract outlined in RFC 9106.

To avoid aliasing mutable references into the entire buffer of blocks
(which would be UB), pointers are used up to the moment where a
reference (shared or mutable) into a specific block is returned. At that
point, aliasing is no longer possible, as argued in SAFETY comments
and/or checked at runtime.

The following tests have been executed in and pass Miri (modulo
unrelated warnings):

    reference_argon2i_v0x13_2_8_2
    reference_argon2id_v0x13_2_8_2

(Running these in Miri is quite slow, taking ~5 minutes each, so
only the most relevant tests were chosen).

Finally, add a default-enabled `parallel` feature, with a otherwise
optional dependency on `rayon`, and parallelize the filling of blocks
using the memory views mentioned above.
…ch64

Based on notes in crossbeam-utils::CachePadded:

https://github.com/crossbeam-rs/crossbeam/blob/17fb8417a83a/crossbeam-utils/src/cache_padded.rs#L63-L79

In summary:

- while x86-64 cache lines are still 64 bytes, modern prefetchers pull
  them in pairs, so for the purpose of preventing false sharing, we need
  128-byte alignment.

- on aarch64 big.LITTLE, the "big" cores use 128-byte cache lines.
Disable it by default, and adjust its dependencies as well as the
minimum rayon version to match balloon-hash and pbkdf2.
… and aarch64"

This reverts commit 1342037.

For reasons not yet understood, aligning blocks to 128-byte boundaries
results in worse performance than using 64-byte alignment.

Looking into this with perf suggests that it is *not* a cache problem,
but rather that the generated code is different and results in
substantially more instructions being executed when the blocks are
128-byte aligned.

For now, revert the alignment back to 64 bytes. While we're at it, also
remove the comment that suggests alignment is only needed to prevent
false sharing: it's possible that other places in the crate, which I
haven't checked, required (for correctness or best performance) the
64-byte alignment we're reverting back to.

It's worth noting that false sharing isn't generally a major issue in
Argon2: due to how memory is accessed, only the first and last few words
of a segment can (and most of the time probably still won't) experience
some false sharing with reads from other lanes.

Finally, changing the alignment with 1342037 would have a major
SemVer change:

https://doc.rust-lang.org/cargo/reference/semver.html#repr-align-n-change
While investigating the scaling performance of the parallel
implementation, I noticed a substantial chunk of time taken on
block allocation in `hash_password_into`.

The issue lies in `vec![Block::default; ...]`, which clones the supplied
block. This happens because the standard library lacks a suitable
specialization that can be used with `Block` (or, for that matter,
`[u64; 128]`).

Therefore, let's instead allocate a big bag of bytes and then transmute
it, or more precisely a mutable slice into it, to produce the slice of
blocks to pass into `hash_password_into_with_memory`.

One point to pay attention to is that `Blocks` currently specifies
64-byte alignment, while a byte slice has alignment of 1.

Luckily, `slice::align_to_mut` is particularly well suited for this. It
is also cleaner and less error prone than other unsafe alternatives I
tried (a couple of them using `MaybeUninit`).

This patch passes Miri on:

    reference_argon2i_v0x13_2_8_2
    reference_argon2id_v0x13_2_8_2

And the performance gains are considerable:

argon2id V0x13 m=2048 t=8 p=4
                        time:   [3.3493 ms 3.3585 ms 3.3686 ms]
                        change: [-6.1577% -5.7842% -5.4067%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=4
                        time:   [24.106 ms 24.253 ms 24.401 ms]
                        change: [-9.8553% -8.9089% -7.9745%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=4
                        time:   [181.68 ms 182.96 ms 184.35 ms]
                        change: [-28.165% -27.506% -26.896%] (p = 0.00 < 0.05)
                        Performance has improved.

(For the users that don't allocate the blocks themselves).
@jonasmalacofilho jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch from 264821d to 018c3e9 Compare January 21, 2025 18:32
@jonasmalacofilho
Copy link
Author

@tarcieri oh, i forgot about that one. Rebased, and thanks for pointing it out!

That said, we should probably try also to add the very cheapest of tests and have it run in Miri in CI:

That said, there is a lot of Undefined Behavior that is not detected by cargo careful; check out Miri if you want to be more exhaustively covered. The advantage of cargo careful over Miri is that it works on all code, supports using arbitrary system and C FFI functions, and is much faster.

@jonasmalacofilho
Copy link
Author

By the way, I think there are some things I can improve in the code, but I would really appreciate a review first. And so I've kept edits to a minimum for now, so that you can actually review it.

@tarcieri
Copy link
Member

Yeah, Miri is tricky specifically because you can't do anything computationally expensive under it. I think we could potentially gate expensive tests under #[cfg(not(miri))] perhaps?

@jonasmalacofilho
Copy link
Author

Yeah, Miri is tricky specifically because you can't do anything computationally expensive under it. I think we could potentially gate expensive tests under #[cfg(not(miri))] perhaps?

I think the 2_8_2 (t=2,m=2,p=2) tests are the cheapest in the crate, and still quite expensive... I could try adding t=1,m=8,p=2 tests and see if they execute in acceptable time in CI.

Additionally, maybe a few unit tests ensuring that allowed borrows pass in Miri, and that some known invalid borrow patterns are either impossible at compile time or caught at runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants