-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
argon2: add parallelism #547
base: master
Are you sure you want to change the base?
argon2: add parallelism #547
Conversation
Could you benchmark the parallel implementation and compare it against the single threaded one? |
This comment was marked as outdated.
This comment was marked as outdated.
memory_blocks | ||
.segment_views(slice, lanes) | ||
.for_each(|mut memory_view| { | ||
let lane = memory_view.lane(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that this fill_blocks
diff is very noisy, due to a necessary indentation change + rustfmt + diff getting somewhat lost.
The only changes to this function are the use of the segment view iterator (here), the accessing of memory through the segment view API instead of through indexing of the memory_blocks
slice (bellow), and changing memory_blocks
to be mutable (above).
master...HEAD with parallel feature
Note: 6-core CPU with SMT. master...HEAD without parallel feature, default param tests only
|
@jonasmalacofilho if you can rebase I added |
Coordinated shared access in the memory blocks is implemented with `SegmentViewIter` and associated types, which provide views into the Argon2 blocks memory that can be processed in parallel. These views alias in the regions that are read-only, but are disjoint in the regions where mutation happens. Effectively, they implement, with a combination of mutable borrowing and runtime checking, the cooperative contract outlined in RFC 9106. To avoid aliasing mutable references into the entire buffer of blocks (which would be UB), pointers are used up to the moment where a reference (shared or mutable) into a specific block is returned. At that point, aliasing is no longer possible, as argued in SAFETY comments and/or checked at runtime. The following tests have been executed in and pass Miri (modulo unrelated warnings): reference_argon2i_v0x13_2_8_2 reference_argon2id_v0x13_2_8_2 (Running these in Miri is quite slow, taking ~5 minutes each, so only the most relevant tests were chosen). Finally, add a default-enabled `parallel` feature, with a otherwise optional dependency on `rayon`, and parallelize the filling of blocks using the memory views mentioned above.
…ch64 Based on notes in crossbeam-utils::CachePadded: https://github.com/crossbeam-rs/crossbeam/blob/17fb8417a83a/crossbeam-utils/src/cache_padded.rs#L63-L79 In summary: - while x86-64 cache lines are still 64 bytes, modern prefetchers pull them in pairs, so for the purpose of preventing false sharing, we need 128-byte alignment. - on aarch64 big.LITTLE, the "big" cores use 128-byte cache lines.
Disable it by default, and adjust its dependencies as well as the minimum rayon version to match balloon-hash and pbkdf2.
… and aarch64" This reverts commit 1342037. For reasons not yet understood, aligning blocks to 128-byte boundaries results in worse performance than using 64-byte alignment. Looking into this with perf suggests that it is *not* a cache problem, but rather that the generated code is different and results in substantially more instructions being executed when the blocks are 128-byte aligned. For now, revert the alignment back to 64 bytes. While we're at it, also remove the comment that suggests alignment is only needed to prevent false sharing: it's possible that other places in the crate, which I haven't checked, required (for correctness or best performance) the 64-byte alignment we're reverting back to. It's worth noting that false sharing isn't generally a major issue in Argon2: due to how memory is accessed, only the first and last few words of a segment can (and most of the time probably still won't) experience some false sharing with reads from other lanes. Finally, changing the alignment with 1342037 would have a major SemVer change: https://doc.rust-lang.org/cargo/reference/semver.html#repr-align-n-change
While investigating the scaling performance of the parallel implementation, I noticed a substantial chunk of time taken on block allocation in `hash_password_into`. The issue lies in `vec![Block::default; ...]`, which clones the supplied block. This happens because the standard library lacks a suitable specialization that can be used with `Block` (or, for that matter, `[u64; 128]`). Therefore, let's instead allocate a big bag of bytes and then transmute it, or more precisely a mutable slice into it, to produce the slice of blocks to pass into `hash_password_into_with_memory`. One point to pay attention to is that `Blocks` currently specifies 64-byte alignment, while a byte slice has alignment of 1. Luckily, `slice::align_to_mut` is particularly well suited for this. It is also cleaner and less error prone than other unsafe alternatives I tried (a couple of them using `MaybeUninit`). This patch passes Miri on: reference_argon2i_v0x13_2_8_2 reference_argon2id_v0x13_2_8_2 And the performance gains are considerable: argon2id V0x13 m=2048 t=8 p=4 time: [3.3493 ms 3.3585 ms 3.3686 ms] change: [-6.1577% -5.7842% -5.4067%] (p = 0.00 < 0.05) Performance has improved. argon2id V0x13 m=32768 t=4 p=4 time: [24.106 ms 24.253 ms 24.401 ms] change: [-9.8553% -8.9089% -7.9745%] (p = 0.00 < 0.05) Performance has improved. argon2id V0x13 m=1048576 t=1 p=4 time: [181.68 ms 182.96 ms 184.35 ms] change: [-28.165% -27.506% -26.896%] (p = 0.00 < 0.05) Performance has improved. (For the users that don't allocate the blocks themselves).
264821d
to
018c3e9
Compare
@tarcieri oh, i forgot about that one. Rebased, and thanks for pointing it out! That said, we should probably try also to add the very cheapest of tests and have it run in Miri in CI:
|
By the way, I think there are some things I can improve in the code, but I would really appreciate a review first. And so I've kept edits to a minimum for now, so that you can actually review it. |
Yeah, Miri is tricky specifically because you can't do anything computationally expensive under it. I think we could potentially gate expensive tests under |
I think the 2_8_2 (t=2,m=2,p=2) tests are the cheapest in the crate, and still quite expensive... I could try adding t=1,m=8,p=2 tests and see if they execute in acceptable time in CI. Additionally, maybe a few unit tests ensuring that allowed borrows pass in Miri, and that some known invalid borrow patterns are either impossible at compile time or caught at runtime. |
Adds a
default-enabledparallel
feature, with anotherwiseoptional dependency onrayon
, and parallelize the filling of blocks using the memory views mentioned above.Coordinated shared access in the memory blocks is implemented with a
SegmentViewIter
iterator, which implements eitherrayon::iter::ParallelIterator
orcore::iter::Iterator
and returnsSegmentView
views into the Argon2 blocks memory that are safe to be used in parallel.The views alias in the regions that are read-only, but are disjoint in the regions where mutation happens. Effectively, they implement, with a combination of mutable borrowing and runtime checking, the cooperative contract outlined in RFC 9106. This is similar to what was suggested in #380.
To avoid aliasing mutable references into the entire buffer of blocks (which would be UB), pointers are used up to the moment where a reference (shared or mutable) into a specific block is returned. At that point, aliasing is no longer possible.
The following tests have been tried in and pass Miri (modulo unrelated warnings):
(Running these in Miri is quite slow, taking ~5 minutes each, so I only ran the most obviously relevant tests for now).
Finally, the alignment ofBlock
s increases to 128 bytes for better prevention of false sharing on modern platforms. The new value is based on notes oncrossbeam-utils::CachePadded
.I also took some inspiration from an intermediate snapshot of #247, before the parallel implementation was removed, as well as from an implementation without any safe abstractions I just worked on for the rust-argon2 crate (sru-systems/rust-argon2#56).