Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would this repo be open to a SIMD implementation of ChaCha20? #667

Closed
oconnor663 opened this issue Dec 16, 2018 · 13 comments · Fixed by #789
Closed

Would this repo be open to a SIMD implementation of ChaCha20? #667

oconnor663 opened this issue Dec 16, 2018 · 13 comments · Fixed by #789

Comments

@oconnor663
Copy link
Contributor

There's probably a factor of 5-6x to be had here. I could potentially port in some experimental code I have in https://github.com/oconnor663/chacha20_simd. What do people think?

@burdges
Copy link
Contributor

burdges commented Dec 16, 2018

Just curious: Are the optimizations in https://github.com/cesarb/chacha20-poly1305-aead/ any different?

@oconnor663
Copy link
Contributor Author

oconnor663 commented Dec 16, 2018

I think there's a substantial difference. I'm not familiar with that code, but by the looks of it it's operating on u32x4 vectors, which would correspond to 128-bit SSE instructions. The 256-bit AVX instruction set would provide about double the throughput, give or take.

@TheIronBorn
Copy link
Collaborator

Looks like chacha20-poly1305-aead expects at least portions of the code to compile to AVX2 instructions: https://github.com/cesarb/chacha20-poly1305-aead/blob/c10fa44e38c54006df7e1fc376158fdfd5befe03/src/simd_opt/u32x4.rs#L38

@TheIronBorn
Copy link
Collaborator

They also apply pshufb optimizations which likely speed up the rotates quite a bit.

@sneves
Copy link
Contributor

sneves commented Dec 17, 2018

When I added the ChaCha generator a lifetime ago, having compatible SIMD implementations was precisely the goal. At the time, however, there were no intrinsics or any good way to write it in Rust, so it seems to have stayed with the reference "one-star performance" implementation ever since.

In my experience writing C++ <random>-compatible engines, vectorized ChaCha8 with a modest ~256- (SSE2+) or ~512-byte (AVX2) state is as fast as PCG, and faster in some applications where you're not just generating random numbers for benchmarking. If the code in question uses integer multiplications, PCG cannibalizes the single multiplication port to a degree that makes ChaCha8, XorShift, etc significantly faster in those use cases.

You can also add forward security (which for some reason is listed as a feature on the documentation) to ChaCha8 very easily, at virtually no cost in performance, but this loses compatibility and jump-ahead capability.

@dhardy
Copy link
Member

dhardy commented Dec 17, 2018

Yes, @oconnor663, very much. Replacing Hc128 in StdRng with a fast, forward secure variant of ChaCha makes a lot of sense (ChaCha12 possibly; I'm not sure which variant is the most appropriate choice but ChaCha8 is very close to the minimum complexity not proven insecure).

PCG is another thing altogether; it really depends on whether you want a secure batched RNG or a minimal-memory-usage RNG. As such we shouldn't talk about replacing PCG with ChaCha in SmallRng.

@vks
Copy link
Collaborator

vks commented Dec 17, 2018

As such we shouldn't talk about replacing PCG with ChaCha in SmallRng.

We could consider removing SmallRng. ChaCha is only 8.5 times as large. Using smaller RNGs might be a very niche case.

@dhardy
Copy link
Member

dhardy commented Dec 18, 2018

I guess we could, but that's off-topic here and a quick search shows that SmallRng is already being used.

@tarcieri
Copy link

tarcieri commented Jan 4, 2019

FYI, I've been meaning to push a core::simd impl of ChaCha20 as part of https://github.com/RustCrypto/stream-ciphers but have not yet had the time to complete it

@vks
Copy link
Collaborator

vks commented Feb 4, 2019

There is another SIMD implementation: https://github.com/cryptocorrosion/cryptocorrosion/tree/master/stream-ciphers/chacha

@vks vks mentioned this issue Apr 8, 2019
@dhardy
Copy link
Member

dhardy commented Apr 19, 2019

@kazcw would you be open to adding rand_core support into your ChaCha implementation? Use of the existing block-rng code should in theory be enough to get us good performance here, though of course it needs testing (also #779 will result in a few changes).

Alternatively we could update the rand_chacha crate with the subset of the code relevant to us.

@kazcw
Copy link
Contributor

kazcw commented Apr 19, 2019

@dhardy Sure, I'll try to get to it this weekend

@tarcieri
Copy link

There's an open PR to add ChaCha20 support to RustCrypto, which could be published as the chacha20 crate.

Personally I think it'd be swell if people could combine efforts on a single, high-quality, well-maintained ChaCha20 implementation all Rust users could leverage.

kazcw added a commit to kazcw/rand that referenced this issue Apr 30, 2019
Performance:
gen_bytes_chacha20: 254 MB/s -> 603 MB/s [on a Xeon L5630 (SSE4.1)]

Minor version bump:
the only breaking change is that no-std builds now require
default-features=false (std is required by default for runtime cpu
detection; no_std builds will use the best implementation supported by
the target-features/target-cpu enabled at compile time)

New functionality:
ChaChaXRng is parameterized by round count at compile time. Convenient
aliases for the typical 20/12/8 round implementations exposed. ChaChaRng
is aliased to ChaCha20Rng for backward compatibility.

Closes rust-random#667
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants