-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One-pass encryption/decryption #74
Comments
Related issue: generalized AEAD implementations based on stream ciphers RustCrypto/traits#45 |
As a complete crypto noob who got here from github explore (But not a complete rust noob), would this be feasible? (Or even just a small part, like trying to make decryption for chacha20poly1305 single pass) How robust are existing tests? Would the change mostly be changing implementations like https://github.com/RustCrypto/AEADs/blob/master/chacha20poly1305/src/cipher.rs#L66-L91 into something more like https://github.com/RustCrypto/AEADs/blob/master/aes-gcm-siv/src/lib.rs#L317-L347 ? |
Yes, but it also needs to be done in a way that actually improves performance. I've tried to do this change naively a few times (to I think doing it properly might require keeping the data flowing through XMM registers... at the very least it needs to all stay in L1 cache. ChaChaPoly is even trickier because this issue is a micro-optimization and so far we don't have an AVX2 backend for Poly1305 (see RustCrypto/universal-hashes#49) |
Thanks for the response! I think I'll give it a shot. Do not let that stop anyone else from trying it since I don't have much hope I'll be able to do much. How important is performance compiling with avx support vs without? (i.e, would you mostly care about speed when compiling with simd extensions or does the "default" cargo build configuration also matter a lot?) |
The |
Let's take
impl<C> Cipher<C> where C: StreamCipher + StreamCipherSeek,
{
pub(crate) fn encrypt_in_place_detached(
mut self,
associated_data: &[u8],
buffer: &mut [u8],
) -> Result<Tag, Error> {
// ...
// Not currently implemented, but imagine we did this:
for chunk in buffer.chunks_mut(BLOCK_SIZE * 4) {
self.cipher.apply_keystream(chunk);
self.mac.update_padded(chunk);
}
// ...
}
}
pub(crate) const BUFFER_SIZE: usize = BLOCK_SIZE * 4;
impl<R: Rounds, MC: MaxCounter> StreamCipher for ChaCha<R, MC> {
fn try_apply_keystream(&mut self, mut data: &mut [u8]) -> Result<(), LoopError> {
// ...
let mut chunks = data.chunks_exact_mut(BUFFER_SIZE);
for chunk in &mut chunks {
let counter_with_offset = self.counter_offset.checked_add(counter).unwrap();
self.block.apply_keystream(counter_with_offset, chunk);
counter = counter.checked_add(COUNTER_INCR).unwrap();
}
// ...
}
}
const BLOCKS: usize = 4;
impl<R: Rounds> Core<R> {
pub fn apply_keystream(&self, counter: u64, output: &mut [u8]) {
debug_assert_eq!(output.len(), BUFFER_SIZE);
unsafe {
let state = State {
a: self.v0,
b: self.v1,
c: self.v2,
d: iv_setup(self.iv, counter),
};
let state = self.rounds(state);
for i in 0..BLOCKS {
for (chunk, a) in output[i * BLOCK_SIZE..(i + 1) * BLOCK_SIZE]
.chunks_mut(0x10)
.zip(
[state.a, state.b, state.c, state.d]
.iter()
.map(|s| s.blocks[i]),
)
{
let b = _mm_loadu_si128(chunk.as_ptr() as *const __m128i);
let out = _mm_xor_si128(a, b);
_mm_storeu_si128(chunk.as_mut_ptr() as *mut __m128i, out);
}
}
}
}
}
pub trait UniversalHash {
fn update_padded(&mut self, data: &[u8]) {
let mut chunks = data.chunks_exact(Self::BlockSize::to_usize());
for chunk in &mut chunks {
self.update(GenericArray::from_slice(chunk));
}
// ...
}
}
impl State {
pub(crate) unsafe fn compute_block(&mut self, block: &Block, partial: bool) {
// ...
self.cached_blocks[self.num_cached_blocks].copy_from_slice(block);
if self.num_cached_blocks < 3 {
self.num_cached_blocks += 1;
return;
} else {
self.num_cached_blocks = 0;
}
let p = Aligned4x130::from_blocks(&self.cached_blocks);
// ...
}
impl Aligned4x130 {
pub(super) unsafe fn from_blocks(src: &[Block; 4]) -> Self {
// 26-bit mask on each 32-bit word.
let mask_26 = _mm256_set1_epi32(0x3ffffff);
// Sets bit 24 of each 32-bit word.
let set_hibit = _mm256_set1_epi32(1 << 24);
// - Load the four blocks into the following 32-bit word layout:
// [b33, b32, b31, b30, b23, b22, b21, b20]
// [b13, b12, b11, b10, b03, b02, b01, b00]
//
// - Unpack the upper and lower 64 bits:
// [b33, b32, b13, b12, b23, b22, b03, b02]
// [b31, b30, b11, b10, b21, b20, b01, b00]
//
// - Swap the middle two 64-bit words:
// a0 = [b33, b32, b23, b22, b13, b12, b03, b02]
// a1 = [b31, b30, b21, b20, b11, b10, b01, b00]
let (lo, hi) = src.split_at(2);
let blocks_23 = _mm256_loadu_si256(hi.as_ptr() as *const _);
let blocks_01 = _mm256_loadu_si256(lo.as_ptr() as *const _);
// ...
}
} So, the hot path above:
So the immediate blocker is that the Once that is addressed,
|
Yeah, it's definitely a drawback that the These sorts of optimization problems for passing data between stream ciphers and universal hash functions were the impetus for the I abandoned that, but now I wonder if maybe |
The fundamental issue here is runtime detection. It not only means that optimal number of blocks processed in parallel can change depending on CPU capabilities (and in some cases even on CPU family!), but also that during combination of primitives we need a way to automatically generate a matrix of possible capability combinations. It means that if algorithm 1 is able to process 3 blocks by default and 8 blocks with feature A and algorithm 2 is able to process 2 blocks by default and 6 with feature B, then ideally when combining them we should generate 3 code paths: by default processing 6 blocks, for feature A processing 8 blocks, for feature B processing 6 blocks, and for feature A and B processing 24 blocks. And if algorithms have different block sizes, problems becomes even harder. Rust does not have good tools for solving this problem and likely will not have them anytime soon. At the very least we would need some kind of function multi-versioning (i.e. an ability to define different function implementations for different target features) with an ability to query available versions at compile time. And ideally we would need trait multi-versioning as well since it's preferable to store chunk size as an associated constant, but allowing public API (via associated constants and types) to change depending on available target features is a sizable can of worms with potentially non-trivial implications. Defining those combinations manually could work to some extent, but it will be hard to maintain and I don't think compiler will be able to optimize out our Round-tripping I hope to alleviate some issues in the new trait versions. It introduces slice-based block-level traits for hashes/MACs/universal hashes, hides chunk size from public API and instead uses callback-based methods. Not only should it help with inlinining, but also effectively inverses control over iteration. In other words, iteration over blocks is controlled not by higher-level code which combines primitives, but at the cipher level. It means that we can branch once per loop, instead of doing it every chunk (compiler currently is unable to optimize it automatically). Also it means that callbacks (which are used for passing blocks to MAC) are executed in the context with enabled target features and known chunk size. Unfortunately this approach is still far from ideal. Roughly it results in the following code: if is_aesni_available() {
for chunk in blocks.chunks_exact_mut(AESNI_CHUNK) {
aesni_encrypt(chunk);
if is_pclmul_available() {
pclumul_mac(chunk)
} else {
default_mac(chunk)
}
}
} else {
for chunk in blocks.chunks_exact_mut(DEFAULT_CHUNK) {
default_encrypt(chunk);
if is_pclmul_available() {
pclumul_mac(chunk)
} else {
default_mac(chunk)
}
}
} In other words, if cipher backend does not cover required features for MAC backend we still have the optimization barrier on our hands. |
@newpavlov have you actually tested that the optimizations you expect actually work out in practice, especially considering things like traits defined in two crates, being consumed by a third, where the first crate is using To set a baseline for maximum performance, I think we could move things like the CPU feature tests into the AEAD crates like Once we're reasonably certain of what a performant implementation looks like, we can experiment with various abstractions, although I'm still a bit unsold on the changes in RustCrypto/traits#727, or at the very least they seem complicated and unclear to me. I feel like there are slice-based abstractions missing from Glossing over a few things, in practice I think the optimal block sizes look like the following: AES-GCM / AES-GCM-SIV
ChaCha20Poly1305
|
No, I only played a bit with small snippets in godbolt. We may need to abuse
I am myself far from 100% happy with the result, but right now I don't see a better path forward and, compared to the current design, I think it's definitely an improvement. Could you please comment in the PR on elements which you don't like or do not fully understand? I would appreciate your feedback sooner than later, since I hope to finalize it in the near future.
I agree, this is why the callbacks in my PR are only done on the cipher side, while MACs and universal hashes are left with the slice-based methods. But we are still left with the problem of target feature branching inside chunk iteration. Even if we are to check redundant features such as CLMUL in Also do not forget that code with enabled target features can not be currently inlined at all, so we definitely should strive to have chunk processing inside context with same target features. |
Left a comment on the PR. Just generally I'm confused what is happening there.
That's why I was suggesting exposing low-level architecture-specific APIs to optimize passing data between ciphers and UHFs. Then the check can be performed at the level of the entire AEAD, once, at the time the AEAD is initialized, and branched upon at the granularity of large AEAD operations. The fast path for the entire core can occur within |
Such API would have to be I guess it could be a practical stop-gap solution and baseline for comparing generic solutions. |
After numerous experiments, I think I've found a good solution to this problem could look like, but, unfortunately, it's blocked on lack of rank-2 polymorphism in Rust. I wrote about it here: https://internals.rust-lang.org/t/15875 So I think the callback-based solution explored in the |
Currently all of the AEAD implementations do two passes over the plaintext/ciphertext when encrypting/decrypting respectively: for encryption, they encrypt the plaintext in the first pass, and authenticate it in the second pass. For decryption, it's vice versa.
A better approach is to pick a number of blocks to operate on in parallel and encrypt/authenticate or authenticate/decrypt in a single pass. This has better cache locality, e.g. when we encrypt data, store the resulting ciphertext, then load it again to do authentication, that is pretty much guaranteed to hit L1 cache when doing it in a single pass (and ideally we could hand off values still stored in e.g. SIMD registers)
This is a tracking issue for converting the implementations of these respective algorithms to be one pass. It also might be good to discuss ways we could have a generic implementation of one pass encryption/decryption in the
aead
crate (especially one specialized for the non-SIVstream-cipher
+universal-hash
use case) which can be reused across different algorithm implementations.aes-gcm
aes-gcm-siv
†aes-siv
†chacha20poly1305
xsalsa20poly1305
†NOTE: SIV modes by definition cannot support 1-pass encryption (because the first pass generates the synthetic IV, which must be known in advance before encryption can be performed). However, they can support 1-pass decryption, since the IV is known in advance in that case.
The text was updated successfully, but these errors were encountered: