argon2: add parallelism #547

jonasmalacofilho · 2025-01-13T03:19:55Z

Adds a ~~default-enabled~~ parallel feature, with an ~~otherwise~~ optional dependency on rayon, and parallelize the filling of blocks using the memory views mentioned above.

Coordinated shared access in the memory blocks is implemented with a SegmentViewIter iterator, which implements either rayon::iter::ParallelIterator or core::iter::Iterator and returns SegmentView views into the Argon2 blocks memory that are safe to be used in parallel.

The views alias in the regions that are read-only, but are disjoint in the regions where mutation happens. Effectively, they implement, with a combination of mutable borrowing and runtime checking, the cooperative contract outlined in RFC 9106. This is similar to what was suggested in #380.

To avoid aliasing mutable references into the entire buffer of blocks (which would be UB), pointers are used up to the moment where a reference (shared or mutable) into a specific block is returned. At that point, aliasing is no longer possible.

The following tests have been tried in and pass Miri (modulo unrelated warnings):

reference_argon2i_v0x13_2_8_2
reference_argon2id_v0x13_2_8_2

(Running these in Miri is quite slow, taking ~5 minutes each, so I only ran the most obviously relevant tests for now).

~~Finally, the alignment of Blocks increases to 128 bytes for better prevention of false sharing on modern platforms. The new value is based on notes on crossbeam-utils::CachePadded.~~

I also took some inspiration from an intermediate snapshot of #247, before the parallel implementation was removed, as well as from an implementation without any safe abstractions I just worked on for the rust-argon2 crate (sru-systems/rust-argon2#56).

newpavlov · 2025-01-13T11:34:04Z

Could you benchmark the parallel implementation and compare it against the single threaded one?

jonasmalacofilho · 2025-01-13T12:57:07Z

argon2/src/lib.rs

+                memory_blocks
+                    .segment_views(slice, lanes)
+                    .for_each(|mut memory_view| {
+                        let lane = memory_view.lane();


Please note that this fill_blocks diff is very noisy, due to a necessary indentation change + rustfmt + diff getting somewhat lost.

The only changes to this function are the use of the segment view iterator (here), the accessing of memory through the segment view API instead of through indexing of the memory_blocks slice (bellow), and changing memory_blocks to be mutable (above).

jonasmalacofilho · 2025-01-16T11:16:18Z

master...HEAD with parallel feature

argon2i V0x10           time:   [21.324 ms 21.344 ms 21.371 ms]                          
                        change: [-0.3322% -0.1068% +0.0761%] (p = 0.34 > 0.05)
                        No change in performance detected.

argon2i V0x13           time:   [21.429 ms 21.447 ms 21.471 ms]                          
                        change: [+0.0329% +0.2197% +0.3896%] (p = 0.01 < 0.05)
                        Change within noise threshold.

argon2d V0x10           time:   [21.302 ms 21.322 ms 21.348 ms]                          
                        change: [+0.6139% +0.8010% +0.9679%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2d V0x13           time:   [21.367 ms 21.384 ms 21.408 ms]                          
                        change: [+1.8140% +1.9978% +2.1628%] (p = 0.00 < 0.05)
                        Performance has regressed.

argon2id V0x10          time:   [21.361 ms 21.379 ms 21.405 ms]                           
                        change: [+1.2980% +1.4700% +1.6321%] (p = 0.00 < 0.05)
                        Performance has regressed.

argon2id V0x13          time:   [21.303 ms 21.320 ms 21.342 ms]                           
                        change: [+0.9147% +1.1631% +1.3556%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2id V0x13 m=2048 t=4 p=4                                                                             
                        time:   [1.6939 ms 1.6979 ms 1.7026 ms]
                        change: [-58.795% -58.661% -58.490%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=16384 t=4 p=4                                                                            
                        time:   [11.230 ms 11.309 ms 11.391 ms]
                        change: [-67.907% -67.695% -67.447%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=65536 t=4 p=4                                                                            
                        time:   [44.778 ms 45.122 ms 45.489 ms]
                        change: [-71.067% -70.867% -70.621%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=262144 t=4 p=4                                                                            
                        time:   [172.61 ms 173.58 ms 174.61 ms]
                        change: [-72.478% -72.337% -72.127%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=2 p=4                                                                            
                        time:   [11.964 ms 12.047 ms 12.132 ms]
                        change: [-69.521% -69.311% -69.093%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=8 p=4                                                                            
                        time:   [45.011 ms 45.311 ms 45.623 ms]
                        change: [-69.838% -69.634% -69.434%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=16 p=4                                                                            
                        time:   [88.879 ms 89.461 ms 90.061 ms]
                        change: [-69.861% -69.687% -69.482%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=24 p=4                                                                            
                        time:   [133.26 ms 134.09 ms 134.93 ms]
                        change: [-69.816% -69.628% -69.446%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=1                                                                            
                        time:   [8.1242 ms 8.1254 ms 8.1268 ms]
                        change: [+1.4099% +1.4320% +1.4529%] (p = 0.00 < 0.05)
                        Performance has regressed.

argon2id V0x13 m=2048 t=8 p=2                                                                             
                        time:   [4.8775 ms 4.9057 ms 4.9336 ms]
                        change: [-39.640% -39.331% -38.984%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=4                                                                             
                        time:   [3.2967 ms 3.3045 ms 3.3137 ms]
                        change: [-59.213% -59.105% -58.995%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=6                                                                             
                        time:   [2.5706 ms 2.5757 ms 2.5827 ms]
                        change: [-68.446% -68.385% -68.301%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=8                                                                             
                        time:   [2.1205 ms 2.1339 ms 2.1500 ms]
                        change: [-73.975% -73.809% -73.631%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=12                                                                             
                        time:   [1.8220 ms 1.8515 ms 1.8819 ms]
                        change: [-77.377% -76.954% -76.482%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=16                                                                             
                        time:   [2.2035 ms 2.2221 ms 2.2437 ms]
                        change: [-73.287% -73.088% -72.841%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=2048 t=8 p=64                                                                             
                        time:   [2.2370 ms 2.2553 ms 2.2788 ms]
                        change: [-74.567% -74.380% -74.087%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=1                                                                            
                        time:   [74.181 ms 74.228 ms 74.292 ms]
                        change: [-0.8519% -0.7318% -0.6115%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2id V0x13 m=32768 t=4 p=2                                                                            
                        time:   [39.565 ms 39.759 ms 39.980 ms]
                        change: [-47.750% -47.455% -47.143%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=4                                                                            
                        time:   [23.032 ms 23.199 ms 23.368 ms]
                        change: [-69.607% -69.389% -69.150%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=6                                                                            
                        time:   [18.127 ms 18.171 ms 18.214 ms]
                        change: [-75.369% -75.303% -75.239%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=8                                                                            
                        time:   [14.412 ms 14.439 ms 14.471 ms]
                        change: [-80.442% -80.403% -80.360%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=12                                                                            
                        time:   [11.878 ms 12.021 ms 12.200 ms]
                        change: [-83.827% -83.654% -83.390%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=16                                                                            
                        time:   [14.359 ms 14.388 ms 14.423 ms]
                        change: [-80.504% -80.462% -80.415%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=32768 t=4 p=64                                                                            
                        time:   [12.239 ms 12.285 ms 12.343 ms]
                        change: [-83.542% -83.480% -83.391%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=1                                                                            
                        time:   [652.11 ms 652.26 ms 652.40 ms]
                        change: [-6.4332% -6.4049% -6.3769%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=2                                                                            
                        time:   [337.65 ms 338.01 ms 338.40 ms]
                        change: [-51.454% -51.401% -51.345%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=4                                                                            
                        time:   [178.52 ms 179.41 ms 180.40 ms]
                        change: [-74.218% -74.087% -73.947%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=6                                                                            
                        time:   [137.57 ms 139.27 ms 141.00 ms]
                        change: [-80.074% -79.832% -79.558%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=8                                                                            
                        time:   [136.21 ms 136.41 ms 136.64 ms]
                        change: [-80.298% -80.265% -80.231%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=12                                                                            
                        time:   [119.20 ms 120.03 ms 121.02 ms]
                        change: [-82.675% -82.535% -82.391%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=16                                                                            
                        time:   [146.64 ms 147.06 ms 147.47 ms]
                        change: [-78.611% -78.557% -78.499%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13 m=1048576 t=1 p=64                                                                            
                        time:   [131.18 ms 131.41 ms 131.64 ms]
                        change: [-80.804% -80.771% -80.735%] (p = 0.00 < 0.05)
                        Performance has improved.

Note: 6-core CPU with SMT.

master...HEAD without parallel feature, default param tests only

argon2i V0x10           time:   [21.365 ms 21.390 ms 21.419 ms]                          
                        change: [-0.9417% -0.7019% -0.4585%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2i V0x13           time:   [21.523 ms 21.548 ms 21.574 ms]                          
                        change: [+0.0241% +0.2325% +0.4389%] (p = 0.03 < 0.05)
                        Change within noise threshold.

argon2d V0x10           time:   [21.201 ms 21.220 ms 21.243 ms]                          
                        change: [-0.6101% -0.4179% -0.2436%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2d V0x13           time:   [21.403 ms 21.426 ms 21.453 ms]                          
                        change: [+0.3981% +0.6366% +0.8608%] (p = 0.00 < 0.05)
                        Change within noise threshold.

argon2id V0x10          time:   [21.241 ms 21.258 ms 21.279 ms]                           
                        change: [-1.7410% -1.5319% -1.3262%] (p = 0.00 < 0.05)
                        Performance has improved.

argon2id V0x13          time:   [21.319 ms 21.335 ms 21.355 ms]                           
                        change: [-0.9682% -0.7757% -0.5904%] (p = 0.00 < 0.05)
                        Change within noise threshold.

tarcieri · 2025-01-21T18:21:23Z

@jonasmalacofilho if you can rebase I added cargo careful in #553 which should help spot issues in unsafe code

Coordinated shared access in the memory blocks is implemented with `SegmentViewIter` and associated types, which provide views into the Argon2 blocks memory that can be processed in parallel. These views alias in the regions that are read-only, but are disjoint in the regions where mutation happens. Effectively, they implement, with a combination of mutable borrowing and runtime checking, the cooperative contract outlined in RFC 9106. To avoid aliasing mutable references into the entire buffer of blocks (which would be UB), pointers are used up to the moment where a reference (shared or mutable) into a specific block is returned. At that point, aliasing is no longer possible, as argued in SAFETY comments and/or checked at runtime. The following tests have been executed in and pass Miri (modulo unrelated warnings): reference_argon2i_v0x13_2_8_2 reference_argon2id_v0x13_2_8_2 (Running these in Miri is quite slow, taking ~5 minutes each, so only the most relevant tests were chosen). Finally, add a default-enabled `parallel` feature, with a otherwise optional dependency on `rayon`, and parallelize the filling of blocks using the memory views mentioned above.

…ch64 Based on notes in crossbeam-utils::CachePadded: https://github.com/crossbeam-rs/crossbeam/blob/17fb8417a83a/crossbeam-utils/src/cache_padded.rs#L63-L79 In summary: - while x86-64 cache lines are still 64 bytes, modern prefetchers pull them in pairs, so for the purpose of preventing false sharing, we need 128-byte alignment. - on aarch64 big.LITTLE, the "big" cores use 128-byte cache lines.

Disable it by default, and adjust its dependencies as well as the minimum rayon version to match balloon-hash and pbkdf2.

… and aarch64" This reverts commit 1342037. For reasons not yet understood, aligning blocks to 128-byte boundaries results in worse performance than using 64-byte alignment. Looking into this with perf suggests that it is *not* a cache problem, but rather that the generated code is different and results in substantially more instructions being executed when the blocks are 128-byte aligned. For now, revert the alignment back to 64 bytes. While we're at it, also remove the comment that suggests alignment is only needed to prevent false sharing: it's possible that other places in the crate, which I haven't checked, required (for correctness or best performance) the 64-byte alignment we're reverting back to. It's worth noting that false sharing isn't generally a major issue in Argon2: due to how memory is accessed, only the first and last few words of a segment can (and most of the time probably still won't) experience some false sharing with reads from other lanes. Finally, changing the alignment with 1342037 would have a major SemVer change: https://doc.rust-lang.org/cargo/reference/semver.html#repr-align-n-change

While investigating the scaling performance of the parallel implementation, I noticed a substantial chunk of time taken on block allocation in `hash_password_into`. The issue lies in `vec![Block::default; ...]`, which clones the supplied block. This happens because the standard library lacks a suitable specialization that can be used with `Block` (or, for that matter, `[u64; 128]`). Therefore, let's instead allocate a big bag of bytes and then transmute it, or more precisely a mutable slice into it, to produce the slice of blocks to pass into `hash_password_into_with_memory`. One point to pay attention to is that `Blocks` currently specifies 64-byte alignment, while a byte slice has alignment of 1. Luckily, `slice::align_to_mut` is particularly well suited for this. It is also cleaner and less error prone than other unsafe alternatives I tried (a couple of them using `MaybeUninit`). This patch passes Miri on: reference_argon2i_v0x13_2_8_2 reference_argon2id_v0x13_2_8_2 And the performance gains are considerable: argon2id V0x13 m=2048 t=8 p=4 time: [3.3493 ms 3.3585 ms 3.3686 ms] change: [-6.1577% -5.7842% -5.4067%] (p = 0.00 < 0.05) Performance has improved. argon2id V0x13 m=32768 t=4 p=4 time: [24.106 ms 24.253 ms 24.401 ms] change: [-9.8553% -8.9089% -7.9745%] (p = 0.00 < 0.05) Performance has improved. argon2id V0x13 m=1048576 t=1 p=4 time: [181.68 ms 182.96 ms 184.35 ms] change: [-28.165% -27.506% -26.896%] (p = 0.00 < 0.05) Performance has improved. (For the users that don't allocate the blocks themselves).

jonasmalacofilho · 2025-01-21T18:37:02Z

@tarcieri oh, i forgot about that one. Rebased, and thanks for pointing it out!

That said, we should probably try also to add the very cheapest of tests and have it run in Miri in CI:

That said, there is a lot of Undefined Behavior that is not detected by cargo careful; check out Miri if you want to be more exhaustively covered. The advantage of cargo careful over Miri is that it works on all code, supports using arbitrary system and C FFI functions, and is much faster.

jonasmalacofilho · 2025-01-21T18:46:18Z

By the way, I think there are some things I can improve in the code, but I would really appreciate a review first. And so I've kept edits to a minimum for now, so that you can actually review it.

tarcieri · 2025-01-21T18:47:15Z

Yeah, Miri is tricky specifically because you can't do anything computationally expensive under it. I think we could potentially gate expensive tests under #[cfg(not(miri))] perhaps?

jonasmalacofilho · 2025-01-21T18:59:15Z

Yeah, Miri is tricky specifically because you can't do anything computationally expensive under it. I think we could potentially gate expensive tests under #[cfg(not(miri))] perhaps?

I think the 2_8_2 (t=2,m=2,p=2) tests are the cheapest in the crate, and still quite expensive... I could try adding t=1,m=8,p=2 tests and see if they execute in acceptable time in CI.

Additionally, maybe a few unit tests ensuring that allowed borrows pass in Miri, and that some known invalid borrow patterns are either impossible at compile time or caught at runtime.

This comment was marked as outdated.

Sign in to view

jonasmalacofilho commented Jan 13, 2025

View reviewed changes

jonasmalacofilho added 6 commits January 21, 2025 15:29

argon2: make parallel feature consistent with other crates in the repo

ea63ecd

Disable it by default, and adjust its dependencies as well as the minimum rayon version to match balloon-hash and pbkdf2.

argon2: simplify ParallelIterator impl for SegmentViewIter

5eefce1

jonasmalacofilho force-pushed the add-parallelism-to-argon2 branch from 264821d to 018c3e9 Compare January 21, 2025 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

argon2: add parallelism #547

argon2: add parallelism #547

jonasmalacofilho commented Jan 13, 2025 •

edited

Loading

newpavlov commented Jan 13, 2025

This comment was marked as outdated.

jonasmalacofilho Jan 13, 2025 •

edited

Loading

jonasmalacofilho commented Jan 16, 2025

tarcieri commented Jan 21, 2025

jonasmalacofilho commented Jan 21, 2025

jonasmalacofilho commented Jan 21, 2025

tarcieri commented Jan 21, 2025

jonasmalacofilho commented Jan 21, 2025

argon2: add parallelism #547

Are you sure you want to change the base?

argon2: add parallelism #547

Conversation

jonasmalacofilho commented Jan 13, 2025 • edited Loading

newpavlov commented Jan 13, 2025

This comment was marked as outdated.

jonasmalacofilho Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

jonasmalacofilho commented Jan 16, 2025

tarcieri commented Jan 21, 2025

jonasmalacofilho commented Jan 21, 2025

jonasmalacofilho commented Jan 21, 2025

tarcieri commented Jan 21, 2025

jonasmalacofilho commented Jan 21, 2025

jonasmalacofilho commented Jan 13, 2025 •

edited

Loading

jonasmalacofilho Jan 13, 2025 •

edited

Loading