Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use c2-chacha for stable, runtime-dispatched SIMD #789

Merged
merged 13 commits into from
May 7, 2019

Conversation

kazcw
Copy link
Contributor

@kazcw kazcw commented Apr 30, 2019

Performance:
gen_bytes_chacha20: 254 MB/s -> 603 MB/s [on a Xeon L5630 (SSE4.1)]

Minor version bump:
the only breaking change is that no-std builds now require
default-features=false (std is required by default for runtime cpu
detection; no_std builds will use the best implementation supported by
the target-features/target-cpu enabled at compile time)

New functionality:
ChaChaXRng is parameterized by round count at compile time. Convenient
aliases for the typical 20/12/8 round implementations exposed. ChaChaRng
is aliased to ChaCha20Rng for backward compatibility.

Closes #667

Performance:
gen_bytes_chacha20: 254 MB/s -> 603 MB/s [on a Xeon L5630 (SSE4.1)]

Minor version bump:
the only breaking change is that no-std builds now require
default-features=false (std is required by default for runtime cpu
detection; no_std builds will use the best implementation supported by
the target-features/target-cpu enabled at compile time)

New functionality:
ChaChaXRng is parameterized by round count at compile time. Convenient
aliases for the typical 20/12/8 round implementations exposed. ChaChaRng
is aliased to ChaCha20Rng for backward compatibility.

Closes rust-random#667
More familiar numbers; evenness requirement enforced by PartialDiv.
@dhardy
Copy link
Member

dhardy commented Apr 30, 2019

Xeon 1231 aka Haswell:

# from
test gen_bytes_chacha20             ... bench:   2,140,914 ns/iter (+/- 85,416) = 478 MB/s
test gen_u32_chacha20               ... bench:       7,330 ns/iter (+/- 128) = 545 MB/s
test gen_u64_chacha20               ... bench:      14,959 ns/iter (+/- 1,032) = 534 MB/s
test init_chacha                    ... bench:          23 ns/iter (+/- 1)
# to
test gen_bytes_chacha8              ... bench:     442,658 ns/iter (+/- 14,419) = 2313 MB/s
test gen_bytes_chacha12             ... bench:     592,790 ns/iter (+/- 13,809) = 1727 MB/s
test gen_bytes_chacha20             ... bench:     891,697 ns/iter (+/- 43,009) = 1148 MB/s
test gen_u32_chacha8                ... bench:       2,153 ns/iter (+/- 67) = 1857 MB/s
test gen_u32_chacha12               ... bench:       2,740 ns/iter (+/- 189) = 1459 MB/s
test gen_u32_chacha20               ... bench:       3,917 ns/iter (+/- 196) = 1021 MB/s
test gen_u64_chacha8                ... bench:       4,676 ns/iter (+/- 165) = 1710 MB/s
test gen_u64_chacha12               ... bench:       5,893 ns/iter (+/- 288) = 1357 MB/s
test gen_u64_chacha20               ... bench:       8,225 ns/iter (+/- 240) = 972 MB/s
test init_chacha                    ... bench:          45 ns/iter (+/- 2)
# compare
test gen_bytes_hc128                ... bench:     455,679 ns/iter (+/- 28,226) = 2247 MB/s
test gen_u32_hc128                  ... bench:       1,916 ns/iter (+/- 49) = 2087 MB/s
test gen_u64_hc128                  ... bench:       3,788 ns/iter (+/- 391) = 2111 MB/s
test init_hc128                     ... bench:       4,659 ns/iter (+/- 177)

Certainly a nice improvement!

Any thoughts @kazcw on which should be our default for StdRng? From what I remember ChaCha8 is potentially marginal but without known exploits, while others have significant margins.

BTW I pushed additional benchmark code to my fork.

@dhardy
Copy link
Member

dhardy commented May 2, 2019

@burdges would you be able to review this?

@vks
Copy link
Collaborator

vks commented May 2, 2019

Those are some very nice performance improvements! HC128 being faster than ChaCha20 and ChaCha12 while being slower than ChaCha8 is consistent with the eBACS benchmarks. However, it seems like the speed difference between ChaCha20 and HC128 is less than in our case, so I wonder whether there is still some potential for optimization.

rand_chacha/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

One concern is the number of crates this depends on (c2-chacha, byteorder, generic-array, typenum, ppv-lite86). We've had a few complaints that Rand pulls in too many dependencies; if we now use this for StdRng then we won't be reducing the number of dependencies as previously promised. This is not a blocking issue but we should consider whether there are good alternatives.

rand_chacha/src/chacha.rs Outdated Show resolved Hide resolved
@dhardy
Copy link
Member

dhardy commented May 3, 2019

HC128 being faster than ChaCha20

Not really; ignoring initialisation time ChaCha20 is more than twice as fast as HC128 on several of those benchmarks.

@vks
Copy link
Collaborator

vks commented May 3, 2019

Not really; ignoring initialisation time ChaCha20 is more than twice as fast as HC128 on several of those benchmarks.

Did you mean including initialization time?

@burdges
Copy link
Contributor

burdges commented May 3, 2019

IETF ChaCha has an initialization phase, but not ChaCha20. I'm not sure if I'm the ideal person to review this really.

@dhardy
Copy link
Member

dhardy commented May 3, 2019

Who is the ideal person to review this then? You don't have to worry so much about correctness since we have test vectors; it's more about the API, approach and style.

@kazcw
Copy link
Contributor Author

kazcw commented May 3, 2019

@dhardy Although the security margin of ChaCha12 is thought to be solid, AFAIK no one is currently using ChaCha with less than 20 rounds for anything important. My inclination would be to default to ChaCha20, to avoid being on the forefront of trusting anything out-of-the-box.

@vks There's room for more performance improvement: the AVX2 backend of ppv-lite86 is currently disabled, so right now only machine features up through SSE4.1 are in use. Once I get that code updated to the new ppv-lite API, performance should be within spitting distance of hand-optimized assembly for this algo.

@dhardy I can cut down the dependencies. I'll see what I can do.

kazcw added 4 commits May 3, 2019 09:45
Eliminated all transitive deps for rand_chacha except c2-chacha (common
impl for rand and stream-cipher) and ppv-lite86 (SIMD implementation).
@kazcw
Copy link
Contributor Author

kazcw commented May 3, 2019

@dhardy As of 90adb48 I eliminated all (transitive) dependencies except c2-chacha (common impl for rand and stream-cipher) and ppv-lite86 (SIMD library).

kazcw added 3 commits May 3, 2019 11:20
miri can't handle libstd's CPU detection. I presume it can't do the SIMD
intrinsics either but I'll address that in ppv-lite.
I don't know if any user would want this but it's needed for miri
@dhardy
Copy link
Member

dhardy commented May 3, 2019

Thanks @kazcw.

Rand is in the interesting position of trying to provide a "general purpose generator", and because of that we don't necessarily want to make the same compromises as others. I wish we could allow configuration by the end user, but the only option we have for that are feature-flags, which are not ideal for this kind of configuration. (I suppose we could have several flags like stdrng_chacha20 and use the most secure of all enabled feature flags, or the most secure option if no flag is used.)

@burdges
Copy link
Contributor

burdges commented May 4, 2019

Interface looks fine. It's basically the same, no? It's fine using distinct types for round count.

I avoid dependencies on RustCrypto whenever possible myself, but maybe not sensible here. Ideally rust would've implicit feature flags other crates in the same build, so RustCrypto would only become a dependency if it was a dependency anyways.

I have not dug into the c2_chacha code.

@kazcw
Copy link
Contributor Author

kazcw commented May 4, 2019

avx2 has landed in the SIMD backend, so that should make your fancy-schmancy Haswells happy 😄. Benchmark comparison from a GCE instance:

# current master:
test gen_bytes_chacha20 ... bench: 2,591,248 ns/iter (+/- 101,109) = 395 MB/s

# c2-chacha branch, with ppv-lite86 0.2.2:
test gen_bytes_chacha20 ... bench: 1,273,655 ns/iter (+/- 26,588) = 803 MB/s

# c2-chacha branch, with ppv-lite86 0.2.3:
test gen_bytes_chacha20 ... bench: 752,480 ns/iter (+/- 25,401) = 1360 MB/s

@dhardy
Copy link
Member

dhardy commented May 5, 2019

Certianly looks good in the micro-benches 😄

test gen_bytes_chacha12             ... bench:     321,606 ns/iter (+/- 4,552) = 3184 MB/s
test gen_bytes_chacha20             ... bench:     491,814 ns/iter (+/- 17,275) = 2082 MB/s
test gen_bytes_chacha8              ... bench:     237,518 ns/iter (+/- 6,924) = 4311 MB/s

We ought to test non-Intel CPUs too; I guess ChaCha20 is a lot slower on ARMv7?

Anyway, I think this PR is ready now? Thanks for all the contributions @kazcw!

@dhardy
Copy link
Member

dhardy commented May 5, 2019

Dependencies are now down to autocfg (used by several Rand crates), c2-chacha, lazy_static, ppv-lite86 and rand_core, which I think is acceptable.


[build-dependencies]
autocfg = "0.1"

[features]
default = ["std"]
default = ["std", "simd"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a breaking change, because the simd feature does not change the API, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, disabling simd just forces use of the same portable implementation as architectures that don't have SIMD backends.

@dhardy dhardy merged commit 65b8198 into rust-random:master May 7, 2019
This was referenced May 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Would this repo be open to a SIMD implementation of ChaCha20?
4 participants