Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming Interface for authenticated encryption? #218

Closed
LoupVaillant opened this issue Jul 25, 2021 · 30 comments
Closed

Streaming Interface for authenticated encryption? #218

LoupVaillant opened this issue Jul 25, 2021 · 30 comments

Comments

@LoupVaillant
Copy link
Owner

LoupVaillant commented Jul 25, 2021

Now that Monocypher is just under 1900 lines of code, we have room for more functionality. One that is very tempting to add is streaming AEAD, similar to Libsodium's.

One reason for the temptation is the ability to share code with crypto_lock() and crypto_unlock(). If the streaming API is provided separately, then a bunch of code, most notably authentication, must be duplicated. Inside Monocypher however, we can define the current AEAD interface in terms of the streaming API. I have written a prototype that currently costs 49 lines of code (and 6 new functions, and one additional struct).

On the other hand, we don't want to include something of limited value. We may want other functionality in the future, and it would be a pity if we had to break the 2KLoC psychological barrier to get it. To be included, this streaming interface should satisfy the following criteria:

  • No major footguns.
  • Close enough to ideal for file encryption.
  • Close enough to ideal for interactive sessions over TCP (or other reliable ordered channel).

If possible, it should also satisfy those bonus criteria:

  • RFC 8439 friendly: Implementing with IETF's ChaCha20 (96-bit nonce, 32-bit counter) should be straightforward.
  • DJB friendly: Implementing it with the original ChaCha20 (64-bit nonce, 64-bit counter) should be straightforward.
  • Fast: we should limit unneeded overhead.
  • Compatible with crypto_lock() (mostly for the sake of elegance).
  • Able to handle unreasonably large chunks without triggering nonce reuse. (Though, who uses large chunks with a streaming API?)

One major question remains about the "no major footguns" criterion: key commitment. As they stand, polynomial hash based ciphers are vulnerable to attacks where the same message could be successfully decrypted under several possible keys. And most importantly, an attacker could find those keys and exploit that for nefarious purposes. There's also a related question about message commitment, where it's possible to find several messages that produce the same authentication tag (that requires knowledge of the key).

We could solve those problems by using a random key robust AEAD, such as Chacha20/Blake2b, or by adding a bunch of encrypted zeroes to the message (well, to each chunk, actually). This is either slower or unwieldy, not to mention the increased size overhead.

I believe there is another way: make sure the protocol leaves no room for tricking the recipient into using the wrong decryption key. Commit the key once, then use ordinary AEAD. Also, we need to determine how much of an issue the lack of message commitment may be.

If this works, then a ChaPoly based streaming interface is probably worth adding. If it does not work, then even crypto_lock() was probably not worth adding, and we'd better not compound the problem with yet another broken construction.

Food for thought, to be considered when we can make the time.

@LoupVaillant
Copy link
Owner Author

Okay, I found why people worry about commitment in the context of encrypted messages. Following this paper about partitioning oracles, Age decided to mitigate the attack by limiting the size of its chunks.

Partitioning oracles are about exploiting the fact that some AEAD constructions, including ChaPoly, allow attackers to craft ciphertext that successfully decrypt under more than one key. Instinctively, you would think that a a ciphertext can only decrypt under the key that generated it. Not quite:

  • By the pigeonhole principle, a 16-byte tag only has 2^128 possible values, while we have 2^256 possible keys. On average, a given ciphertext has the right tag for 2^128 possible keys. Though it doesn't matter if one can't find those keys efficiently.
  • In some cases, an attacker can efficiently craft a message that decrypts under several keys. The worst possible case would be if they could craft messages that decrypt under half of all possible keys.

The partitioning oracle attack works thus: first, the attacker crafts a ciphertext that decrypts under half of all possible keys, such that the attacker knows that half. Then they send that message to the victim, that candidly leaks whether decryption was successful or not (through an error message or timing leak). Now the attacker has gained one bit of information about the key. Repeat the operation 255 more times, and voilà, the attacker has the whole key, and the victim is pwned.

While practical attacks aren't that powerful (against AES-GCM, the paper talks about covering up to 2^16 keys in a single message, and only up to 10 in the case of ChaPoly), they can still be quite scary in two scenarios:

  • Password based encryption. Passwords have relatively little entropy. By ruling out several possible password in a single query, we could considerably speed up password recovery.
  • De-anonymization. If we want to test which public key a recipient is using, and we have a list of possible public keys, we could speed up the search by ruling out several keys in a single query.

Age mitigate those attacks by limiting the size of its chunks: with this limitation, it's only possible to cover two keys per query instead of a gazillion.


Now as much as we could discourage scenarios under which there is a decryption oracle, we're gonna have to assume users may fall into this trap. So as a crypto library, it would be best to only provide committing AEAD. Unfortunately, (i) Monocypher already provides a non-committing ChaPoly construction, and (ii) polynomial hashes are fast. So there's seem to be a tension between usability and performance.

That said, there is a way to defeat partitioning Oracles: commit the key before actually using it:

  • In the context of interactive key exchange, this means using a key exchange mechanism that unambiguously selects the shared secret. That is, make it so there is only one possible shared secret, and if any party uses another one, the key exchange fails. I think, but don't know for sure, that modern key exchange protocols such as Noise already have this guarantee.

  • In the context of non-interactive encryption, this means checking the validity of the key before trying to decrypt anything. We could for instance, start the file with a hash of the key or similar, so that we can abort decryption if we don't see the right hash. This is useful to not waste time for files on which we're trying the wrong key, or checking multiple possible slots in the case of PURBs. The disadvantage is the possibility of leaking the fact that we failed to open the file header, as opposed to a decryption failure elsewhere. It's not clear to me what interesting attack could be mounted with this information, though.

So, if we manage to commit the key before starting the secure session or decrypting the actual message, we should be good. No need to limit the size of chunks like Age does. And since this naturally happens in a number of situation, a non-committing AEAD could be viewed as legitimate.

Still, we must recognise that partitioning oracle attacks are a foot gun, and we must consider carefully whether that should stop us. In any case, we probably want to add a security considerations section about them in the current manual.

@snej
Copy link
Contributor

snej commented Jan 14, 2022

Encrypted streams would be very good to have.

Sodium's crypto_stream isn't really a stream, however, it's a sequence of messages, and the encrypting and decrypting parties have to agree on what the sizes of the messages are, because the decryption API requires the caller to pass in the entire encrypted message.

This is a pain in the butt for use in network connections, because

  • You can't just divide the stream into fixed-size messages, because in an interactive protocol the sender will send some data and await a response, and if that data ends up in an unfilled buffer in the encryption layer, the connection will deadlock.
  • So messages have to be variable length. But that means framing, i.e. prefixing the encrypted message with its length in bytes. This tends to reveal the message boundaries to eavesdroppers.
  • The workaround that I've seen (e.g. Scuttlebutt's box-stream) is to encrypt the byte-count, then encrypt the message. That's extra work and adds a second MAC to every message.

@LoupVaillant
Copy link
Owner Author

Hmm, I'm not sure we can meaningfully hide message sizes from traffic analysis in an interactive protocol, so I'm not sure Scuttlebutt's workaround really works: when you're sending a message and there's no other message right behind it, you have to send your first message anyway, and encrypting its length won't do you much good if the eavesdropper can just notice a pause in the data stream. If you really want to hide message sizes & boundaries, I have two strategies in mind:

  • Group messages in a single, bigger message. That may mean accumulating messages until you have enough to send.
  • Pad your messages. The overhead will be… significant, though it could be reduced by message grouping.

How effective those strategies are really is application dependent. And more to the point, I don't think I can come up with a nice, simple, low-overhead streaming scheme that supports any of the above out of the box.

Then there are the use cases where you do know chunk sizes in advance: file encryption and streaming lots of data. In both cases, chunk length can be a hard coded convention, written at the beginning of the file, or agreed upon during the handshake. For these use cases, libsodium's approach minimises overhead.

Personally, I don't think I can find a lightweight approach that satisfies your use case out of the box. I'm afraid users will have to endure butt pain, at least until I get around to write an actual network library.

This was referenced Jan 16, 2022
@LoupVaillant
Copy link
Owner Author

Seems like I finally have a design worth considering (latest version). Here's the API:

typedef struct {
    uint32_t counter; // MSB of the ChaCha20 counter
    uint8_t  key[32];
    uint8_t  nonce[8];
} crypto_stream_ctx;

void crypto_stream_init(crypto_stream_ctx *ctx,
                        const uint8_t key[32], const uint8_t nonce[24]);
void crypto_stream_write(crypto_stream_ctx *ctx,
                         uint8_t            mac[16],
                         uint8_t           *cipher_text,
                         const uint8_t     *plain_text, size_t text_size);
int crypto_stream_read(crypto_stream_ctx *ctx,
                       uint8_t           *plain_text,
                       const uint8_t      mac[16],
                       const uint8_t     *cipher_text, size_t text_size);

void crypto_stream_rekey(crypto_stream_ctx *ctx);
void crypto_stream_init_raw(crypto_stream_ctx *ctx,
                            const uint8_t key[32],
                            const uint8_t nonce[8],
                            uint32_t counter);

// with additional data
void crypto_stream_write_aead(crypto_stream_ctx *ctx,
                              uint8_t            mac[16],
                              uint8_t           *cipher_text,
                              const uint8_t     *ad        , size_t ad_size,
                              const uint8_t     *plain_text, size_t text_size);
int crypto_stream_read_aead(crypto_stream_ctx *ctx,
                            uint8_t           *plain_text,
                            const uint8_t      mac[16],
                            const uint8_t     *ad        , size_t ad_size,
                            const uint8_t     *cipher_text, size_t text_size);

A few things to consider:

  • The true kernel of this whole interface is crypto_stream_init_raw(), crypto_stream_write_aead() crypto_stream_read_aead(), and crypto_stream_rekey(). Everything else can be implemented just by calling those functions. This includes the existing non-streaming API, and even RFC 8439.

  • Right now we have crypto_stream_init() and crypto_stream_init_raw(), but realistically the only purpose of the latter is to implement crypto_stream_rfc8439_init() and crypto_stream_djb_init(). Having raw access reduces the surface of the core API, but then users would have to write the following themselves:

void crypto_stream_rfc8439_init(crypto_stream_ctx *ctx,
                                const uint8_t key[32], const uint8_t nonce[12])
{
    crypto_stream_init_raw(ctx, key, nonce + 4, load32_le(nonce));
}

void crypto_stream_djb_init(crypto_stream_ctx *ctx,
                            const uint8_t key[32], const uint8_t nonce[8])
{
    crypto_stream_init_raw(ctx, key, nonce, 0);
}
  • For consistency with the rest of the library, the read() and write() functions come in two flavours: one with additional data, and one without. Consistency is important, but it does cost us 12 lines of code and 2 exported symbols. Probably not a big deal, though still worth noting.

  • Despite being called automatically to avoid counter-nonce overflow (which with the current design happens once every 4 billion messages), The rekey() function still needs to be exposed for easy forward secrecy (symmetric ratchet) in long lived symmetric sessions.

  • When used to implement the current crypto_lock() or a "DJB original" ChaCha20 version of it, the streaming API has no practical limit on message length. However, as soon as we have more than one message in the stream, message length (except for the last message before rekeying) is limited to 512GiB (minus 64 bytes). This is in my opinion an acceptable limit, considering that people use streaming interfaces precisely to avoid such unreasonably gigantic chunks.

  • Since IETF ChaCha20 and DJB ChaCha20 can both be implemented in terms of each other, we can write the same streaming interface with an IETF version of ChaCha20 instead. Doing so would likely limit all message lengths to 256GiB, but apart from that we can achieve perfect interoperability with existing IETF ChaCha20 implementations.

  • Just writing this comment caused me to correct a critical nonce&counter reuse in the rekey step. There may be more errors. Thus, the exact design of the streaming interface absolutely must be formalised and peer reviewed before we even dream of merging it to the master branch.

@fscoto
Copy link
Contributor

fscoto commented Feb 1, 2022

I haven't actually used the API yet, so take the following with a grain of salt, but I'd like to know these ahead of time going in:

  1. Is the libsodium crypto_secretstream_xchacha20poly1305_* family compatible with the crypto_stream_* family? Considering a large amount of questions we get is about libsodium compatibility, I would assume this to be a consideration.
  2. Considering this is a high-level offering, are you sure you want a detachable authentication tag? The rest of Monocypher consistently has it detached, but they're also lower-level building blocks. My understanding of the purpose of this new interface is e.g. in network communications, agree on a secret key and then just have both sides talk to each other with the streaming interface, or e.g. in age-style file encryption, just dump the stream output and treat it as opaque. I'm not sure if a higher-level construction should detach the MAC, especially considering RFC 8439 § 2.8 formally defines that the AEAD is ciphertext followed by MAC.
  3. Why rfc8439 instead of ietf in the function name? This seems inconsistent with the other ChaCha20 functions.

@snej
Copy link
Contributor

snej commented Feb 1, 2022

This looks reasonable to me (and similar to the libSodium API that I recently built a byte-oriented stream around.) I assume the behavior of crypto_stream_rekey is such that both the sender and receiver have to call it at the same "time" in the message stream?

@LoupVaillant
Copy link
Owner Author

@fscoto:

  1. No it's not. I considered it, but I disagree pretty strongly with Libsodium's design choices on the matter: its streaming interface is incompatible with its monolithic interfaces, and requires too much code to be implemented without significant runtime overhead.

  2. Actually, this is quite a low-level offering, to the point where the monolithic AEAD is implemented in terms of it. I couldn't do that without a detached tag. Another reason is consistency with the existing API. Another reason yet is buffer lengths: I'd rather have the size of the plaintext and ciphertext be the same, makes it easier to know how long each buffer must be. (If I could go back in time, maybe I would reattach the tag everywhere, though to be honest I'm not sure.)

  3. Ah, you're right, ietf is better. Blame the naming scheme I (mistakenly) chose for the test suite. Note however that this alternate init function will probably not be included: not only would it require more code, consistency would almost demand that we implement the corresponding monolithic versions as well, meaning 8 functions. Way too much code for too little benefit.


@snej:

Yes, both parties need to call rekey() at the same time. There's no tag for the recipient to detect & trigger rekey with, it must be decided manually by the application writer (either by convention, or by communicating the rekey order in-band).

@LoupVaillant
Copy link
Owner Author

Added some tests, and refined the design. I have managed to lift the size limits of chunks, at the cost of an increased rekey overhead (the automatic rekey occurs every 256 chunks instead of every 4 billions).

The API maintains an internal context with the following information:

  • A 32-byte key.
  • An 8-byte nonce.
  • A 32-bit counter.

There are 3 ways to initialise this context that makes some sense:

  • From a key and 24-byte nonce (the default):
    • ctx.key = HChacha20(key, nonce[0:16[)
    • ctx.nonce = nonce[16:24[
    • ctx.counter = 0
  • From a key and 8-byte nonce ("DJB original"):
    • ctx.key = key
    • ctx.nonce = nonce
    • ctx.counter = 0
  • From a key and 12-byte nonce (RFC 8439)
    • ctx.key = key
    • ctx.nonce = nonce[4:12[
    • ctx.counter = little_endian(nonce[0:4[)

(I have realised that the 12 byte nonce cannot be used for the incremental interface because of catastrophic nonce reuse problems with chunks beyond the first one, as well as rekeying. Because of this, the current design only provides the short and extended nonce versions, which do not have this problem. I removed the raw_init() function, which is impossible to document. That being said, full compliance with RFC 8439 is still possible for messages that use only one chunk. It requires poking at the internals of the context, but this beats copying & editing the entire AEAD code.)

Nonce sizes have their usual pros & cons: short nonces have no overhead, but they cannot be random. Long nonces can be random, but they incur an additional HChaCha20 call. Once the context is initialised, we can start sending message chunks. Each chunk is built thus:

  • counter = ctx.counter << 32
  • ciphertext, tag = AEAD(ctx.key, ctx.nonce, counter)
  • ctx.counter = ctx.counter + 2^24
  • If ctx.counter == 0, rekey()

The AEAD works the same as RFC 8439, except with a 64-bit counter and a 64-bit nonce, and the start value of the counter is not always zero. The counter is then incremented, and if it cycles back to zero we rekey.

With the chosen increment, we end up rekeying every 256 messages. This means one one additional HChaCha20 call per 256 messages. In the absolute worst case (empty messages), the overhead is less than 0.4%. In a reasonable worst case (short messages), the overheads halves down to 0.2%. In a typical streaming scenario with 32KiB chunks, even if we assume perfect vector instructions parallelism we get a 0.006% overhead, which I deem negligible. The reason for this big increment is because it ensures message chunk lengths are unlimited (the theoretical limit is 2^62 bytes, which is close enough to infinity).

Finally, the rekey operation derives the next key from the current one:

  • ctx.key = HChaCha20(ctx.key, 0xffffffffffffffff || ctx.nonce)
  • ctx.counter does not change.
  • ctx.nonce does not change.

With this design, I believe we have a versatile, limitless, low-overhead streaming API, that as a bonus even allows RFC 8439 compliance if we can tolerate a tiny little hack (I currently use that hack in the tests to verify RFC compliance). Though I'll still seek peer review, I'm fairly confident the current design is sound. The only thing left to do now is discuss tradeoffs.

@LoupVaillant
Copy link
Owner Author

One small change to the rekeying algorithm: we should reset the counter to zero.

This changes nothing for automatic rekeying, because the counter is already zero. However, manual rekeying can happen at any time, and when it happens, we can safely delay automatic rekeying to reduce overhead.

@LoupVaillant
Copy link
Owner Author

I had an epiphany. 😎

Rekeying can be achieved by copying 32 previously unused bytes. See, RFC 8439 requires one ChaCha20 call to generate the authentication key. Strangely enough, it does not use HChacha20, but regular Chacha20. This incurs a small overhead, but Monocypher conforms to it because it provides cheap compatibility with Libsodium. Thing is, a ChaCha20 block is 64 bytes, and the authentication key is only uses the first 32. The last 32 can be used however we want.

This is ideal for rekeying. And since the overhead is so low (just copy 32 bytes), we can afford to do it for every single message, which by the way makes the rekey() function redundant. The result is very cheap, automatic forward secrecy with a symmetric ratchet.

Even better, since we change the key on a systematic basis, we no longer need to mess with the counter. As a consequence, we now can use 96-bit nonces with multi-chunk messages without fearing any collision, so I can justify core support for those (and along them RFC 8439 compatibility).

Icing on the cake, users now have a (somewhat low-performance) fast key erasure user-space CSPRNG at their disposal. When they need new random bytes, they can just write a new message from a stream context, and the current seed will automatically be erased. With this poor man's RNG, the temptation to add a dedicated one plummets.

This design is also very close to full blown key commitment support: for single chunk messages, we can use the key as an additional tag instead of a key for the next chunk, and have the recipient compare that tag along with the authentication tag. I'm not sure this is worth the trouble however, because of the increased message size overhead.

And on top of it the whole thing is simpler. I saved 8 lines of code.


The API didn't change much: I've removed the now redundant rekey() function, and added crypto_stream_init_ietf():

void crypto_stream_init(crypto_stream_ctx *ctx,
                        const uint8_t key[32], const uint8_t nonce[24]);
void crypto_stream_init_djb(crypto_stream_ctx *ctx,
                            const uint8_t key[32], const uint8_t nonce[8]);
void crypto_stream_init_ietf(crypto_stream_ctx *ctx,
                             const uint8_t key[32], const uint8_t nonce[12]);

void crypto_stream_write(crypto_stream_ctx *ctx,
                         uint8_t            mac[16],
                         uint8_t           *cipher_text,
                         const uint8_t     *plain_text, size_t text_size);
int crypto_stream_read(crypto_stream_ctx *ctx,
                       uint8_t           *plain_text,
                       const uint8_t      mac[16],
                       const uint8_t     *cipher_text, size_t text_size);

void crypto_stream_write_aead(crypto_stream_ctx *ctx,
                              uint8_t            mac[16],
                              uint8_t           *cipher_text,
                              const uint8_t     *ad        , size_t ad_size,
                              const uint8_t     *plain_text, size_t text_size);
int crypto_stream_read_aead(crypto_stream_ctx *ctx,
                            uint8_t           *plain_text,
                            const uint8_t      mac[16],
                            const uint8_t     *ad        , size_t ad_size,
                            const uint8_t     *cipher_text, size_t text_size);

Internally, initialisation is unchanged, but the way chunks are generated is simplified:

ciphertext, tag = AEAD(ctx.key, ctx.nonce, ctx.counter)
ctx.key         = free_bytes_from_AEAD

That simple. If I zoom in a little and detail the AEAD we get this:

auth_key, new_key = ChaCha20(ctx.key, ctx.nonce, ctx.counter, zeroes, 64)
ciphertext        = ChaCha20(ctx.key, ctx.nonce, ctx.counter + 1, plaintext, message_size)
tag               = AUTH(auth_key, ad, ad_size, ciphertext, text_size);
ctx.key           = new_key

The AUTH() step is directly taken from RFC 8439. The new_key is basically free, because a ChaCha block is 64 bytes anyway.

@fscoto
Copy link
Contributor

fscoto commented Feb 5, 2022

So essentially, you're doing away with the rekey method because every stream chunk is an immediate re-key anyway at the cost of an extra 32-byte copy for each message. Do you have your RPi on hand? I wonder how bad the extra copy expenses end up being on a smaller platform.

Going back to try and rewrite fk to use the new API, I notice that I'm missing libsodium's crypto_secretstream tag handling. Namely, I'm missing a way to signal end-of-stream. The current API works well enough for a network-based API where the connection just gets dropped or there's a protocol-level "we are going to stop communicating" message. However, you cannot detect a premature end-of-stream with the current API, i.e. the size of the entire steam can only be effectively authenticated with user-level code rather than library code and involves non-trivial extra effort in the form of branches and relatively fragile additional code that is easy to forget to test.

Was this part of your threat model?

@LoupVaillant
Copy link
Owner Author

So essentially, you're doing away with the rekey method because every stream chunk is an immediate re-key anyway at the cost of an extra 32-byte copy for each message.

Yup.

Do you have your RPi on hand? I wonder how bad the extra copy expenses end up being on a smaller platform.

I don't have an R-Pi, but we do have the compiler explorer. Here's what the extra COPY() loop turns into under -O2 with the latest GCC:

        add     lr, sp, #72
        add     ip, sp, #40
        ldmia   lr!, {r0, r1, r2, r3}
        str     r0, [r7, #8]      @ unaligned
        str     r1, [r7, #12]     @ unaligned
        str     r2, [r7, #16]     @ unaligned
        str     r3, [r7, #20]     @ unaligned
        ldmia   lr!, {r0, r1, r2, r3}
        str     r0, [r7, #24]     @ unaligned
        str     r1, [r7, #28]     @ unaligned
        str     r2, [r7, #32]     @ unaligned
        str     r3, [r7, #36]     @ unaligned

-O3 is basically identical (it maybe saves one add), but -Os does keep a loop:

       add     r6, sp, #64
[...]
        mov     r3, r5
.L71:
        mov     r2, r6
        adds    r3, r3, #8
        ldmia   r2!, {r0, r1}
        str     r0, [r3, #-8]     @ unaligned
        str     r1, [r3, #-4]     @ unaligned
        mov     r6, r2
        cmp     r2, r4
        bne     .L71

In both cases we're copying word by word, which should be quite a bit faster than going byte by byte. Overall, I don't expect the slowdown will be noticeable, except perhaps for empty messages (and even then we still process one Chacha20 block and one Poly1305 block for the authentication tag).

Was this part of your threat model?

Oops, it was not to be honest. This tag looked like an unnecessary complication, so I just scraped it without much thought. That said, I think we have an ace up our sleeve: the additional data. We just need to add something to it at the last block. If the streams cuts of too soon (or too late!), the recipient will have the wrong additional data for its last block, and decryption will automatically fail.

#include <monocypher.h>

#define BLOCK_SIZE (32 << 10)
uint8_t END_TAG[4] = "LAST";

void encrypt(uint8_t *out, size_t *out_size,
             const uint8_t key[32], const uint8_t nonce[24],
             const uint8_t *in, size_t in_size)
{
    *out_size = 0;
    crypto_stream_ctx ctx;
    crypto_stream_init(&ctx, key, nonce);
    uint64_t nb_blocks = (in_size + BLOCK_SIZE -1) / BLOCK_SIZE;
    for (uint64_t i = 0; i < nb_blocks - 1; i++) {
        crypto_stream_write(&ctx, out, out+16, in, BLOCK_SIZE);
        in        += BLOCK_SIZE;
        in_size   -= BLOCK_SIZE;
        *out_size += BLOCK_SIZE + 16;
    }
    crypto_stream_write_aead(&ctx, out, out+16, END_TAG, 4, in, in_size);
}

int decrypt(uint8_t *out, size_t *out_size,
            const uint8_t key[32], const uint8_t nonce[24],
            const uint8_t *in, size_t in_size)
{
    *out_size = 0;
    crypto_stream_ctx ctx;
    crypto_stream_init(&ctx, key, nonce);
    uint64_t nb_blocks = (in_size + BLOCK_SIZE + 15) / (BLOCK_SIZE + 16);
    for (uint64_t i = 0; i < nb_blocks - 1; i++) {
        if (crypto_stream_read(&ctx, out, in, in+16, BLOCK_SIZE)) {
            return -1;
        }
        in        += BLOCK_SIZE + 16;
        in_size   -= BLOCK_SIZE + 16;
        *out_size += BLOCK_SIZE;
    }
    return crypto_stream_read_aead(&ctx, out, in, END_TAG, 4, in+16,BLOCK_SIZE);
}

There, no additional branch, and I believe very little additional code overall. How would that translate to your file encryption utility?

@fscoto
Copy link
Contributor

fscoto commented Feb 5, 2022

Doesn't this handling of the end tag imply that you know and can transmit the size of the entire stream ahead of time so that you could just authenticate the entire message length as AD in the first block and just error out then? I'm more concerned about scenarios where the full size of the streamed data isn't known ahead of time, such as when JSON or encoded ASN.1 output is streamed out on the fly.

Re fk in particular: I wanted to avoid having to determine the file size ahead of time and just stream the chunks in since fseeko() isn't part of the C standard library and ftell() only returns int breaking large files. I guess it ends up working, just not in as much standards purity as I hoped for.

@LoupVaillant
Copy link
Owner Author

OK, in this case there is no escaping the overhead associated with transmitting that tag across the wire. Libsodium by the way does it quite inefficiently, reserving an entire additional ChaCha20 block for the tag, then authenticating all 64 bytes (16 would have sufficed, the only reason this is not too slow is vector instructions). I'm not sure there's an ideal solution to be honest. Right now I see three approaches:

  • Transmit the tag unencrypted, but make it part of the additional data. No PURB for you, but we're only leaking sizes, so it has valid use cases.
  • If we need to make the tag look random, we can split the encryption key in two, and encrypt tags with the secondary key (and still authenticate it as additional data). We can even reduce encryption overhead by batching the encryption of tags by blocks of 64.
  • We could just transmit the information in-band, interleaved with the actual data. But unless the way you access the plaintext doesn't already involves a copy in a buffer of your own choosing, this will mean an extra copy, and this time it won't be just 32 bytes per chunk.

With the interface I'm proposing, I believe all three approaches are possible. They're not trivial, but if we chose & provided one out of the box we'd paint ourselves into a corner just like Libsodium, which have etched their overhead & extra complexity to the wire format itself. I'd like to avoid that. Some use cases will likely benefit from one approach more than another, and others won't even need the overhead. I want to allow them all if possible.

That being said, I'll need to write actual code to get a better feel for this API. One core design goal after all is for it to be suitable for file formats & network protocols, so this has to be tested. And though I'm currently fairly confident about the performance, simplicity, and versatility of my design, further tests may yet send me back to the drawing board.

@LoupVaillant
Copy link
Owner Author

Yet another approach just occurred to me:

  • The stream ends when we get a chunk smaller than the maximum agreed upon chunk size. And if the data is an exact multiple of the chunk size, just send an additional empty chunk. The recipient then just needs to check the size of the last chunk, and make sure the transmission (or the file) does not end by a full sized one.

@ewd340
Copy link

ewd340 commented May 5, 2022

First of all, thank you for your work on Monocypher. I like how clean it is.

I've read with interest this thread and wanted to chime in to tell you that I came across Monocrypt, a WIP by @skeeto, that uses Monocypher to encrypt/decrypt files using the high level crypto_lock and crypto_unlock primitives of Monocypher.

This tool doesn't need to know à priori the file's size. So I thought that it may give you some inspiration for this streaming interface.

Again, thank you for your work, and sorry if my comment is out of topic.

@LoupVaillant
Copy link
Owner Author

Thanks for the heads up @ewd340, I’ll take a look.

@ewd340
Copy link

ewd340 commented May 26, 2022

I finally got some time tp play with the streaming interface (from the stream branch), and wrote a very basic utility (heavily inspired by Monocrypt). Now, I have to admit, I'm not a cryptographer, but this exercise was fun.
I have tried it (not heavily tested) on a small 'hello world' text file, and on a 360Mb video file, and it seems to work.

#define CHUNKLEN (128 << 10) - 16
#define VERSION 0
uint8_t END_TAG[4] = "LAST";

enum error {ERR_OK, ERR_KEYFILE, ERR_READ, ERR_WRITE, ERR_TRUNC, ERR_INVALID};

I have defined a small header, that could have been only a nonce actually:

struct aead_header {
    uint8_t nonce[24];
    uint8_t version;
};

The encryption function:

static enum error fencrypt(FILE *in, FILE *out, const uint8_t key[32]){
    int eof;
    int len;
    uint8_t *ad;
    size_t adlen;
    enum error err = ERR_OK;

    struct aead_header header;
    uint8_t buf_in[CHUNKLEN];
    uint8_t buf_out[CHUNKLEN+16];

    getrandom(header.nonce, 24, GRND_NONBLOCK);
    header.version = VERSION;

    crypto_stream_ctx ctx;
    crypto_stream_init(&ctx, key, header.nonce);

    if (!fwrite(&header, sizeof(header), 1, out)) {
        return ERR_WRITE;
    }

    do{
        len = fread(buf_in, 1, CHUNKLEN, in);
        if (!len && ferror(in)) {
            err = ERR_READ;
            break;
        }
        eof = feof(in);
        ad = (eof)?END_TAG:0;
        adlen = (eof)?4:0;
        crypto_stream_write_aead(&ctx, buf_out, buf_out+16, ad, adlen, 
                                 buf_in, len);

        if (!fwrite(buf_out, 1, (size_t) len+16, out)) {
            err = ERR_WRITE;
            break;
        }  
    } while(!eof);

    return err;
}

And the decryption one:

static enum error fdecrypt(FILE *in, FILE *out, const uint8_t key[32])
{
    int eof;
    int len;
    uint8_t *ad;
    size_t adlen;
    enum error err = ERR_OK;

    struct aead_header header;
    uint8_t buf_out[CHUNKLEN];
    uint8_t buf_in[CHUNKLEN+16];

    crypto_stream_ctx ctx;

    /* Read the header to get some info about the encryption: nonce, ver... */
    if (!fread(&header, 1, sizeof(header), in)) {
        return ferror(in) ? ERR_READ : ERR_TRUNC;
    }

    crypto_stream_init(&ctx, key, header.nonce);

    do {
        len = fread(buf_in, 1, CHUNKLEN+16, in);

        if (!len && ferror(in)) {
            err = ERR_READ;
            break;
        }

        eof = feof(in);
        ad = (eof)?END_TAG:0;
        adlen = (eof)?4:0;

        if(crypto_stream_read_aead(&ctx, buf_out, buf_in, ad, adlen, 
                                   buf_in+16, len-16)) {
            err = ERR_INVALID;
            break;
        }

        if(!fwrite(buf_out, (size_t) len-16, 1, out)) {
            err = ERR_WRITE;
            break;
        }
    } while(!eof);

    return err;
}

Again, this was just a small POC, please let me know if I'm missing something.

@snej
Copy link
Contributor

snej commented May 27, 2022

For what it's worth, my network-stream code is in my secret-handshake-cpp repo.

There are two layers: CryptoBox works with chunks of encrypted data, reading/writing an entire chunk; and CryptoStream is a byte-oriented stream interface that reads and writes CryptoBoxes.

This has not been heavily used yet, but it works well enough to support the CapnProto RPC protocol on a TCP socket.

@LoupVaillant
Copy link
Owner Author

@ewd340, though it may have been intentional for readability, you may have missed one thing: getrandom(2) is a low-level system call that is allowed to fail. Worse, it may not give you as many random bytes as you want even if it succeeds. I would recommend arc4random_buf(3) instead if it is available. If not, maybe implement it yourself with getrandom(2) under the hood, such that the only possible outcomes are success or a clean crash.

The way you use additional data to signal the end of the stream and avoid truncation attacks looks correct. It's also simple and cheap, I like it.

@LoupVaillant
Copy link
Owner Author

@snej, thanks for the reference, I'll take a look, see if what I'm proposing here could have been used to simplify your code.

@ewd340
Copy link

ewd340 commented May 28, 2022

Thank you so much for the feedback, @LoupVaillant!
I didn't honestly pay enough attention to the fact that the getrandom(2) primitive may fail for buffer lengths less than 255. Yet, yes, about its return value, the man clearly states:

The careful programmer will check for this anyway!

Would something like this do the trick?

static enum error random_buf(uint8_t *buf, size_t buflen)
{
    enum error err = ERR_OK;
    if (buflen > 255) {
        err = ERR_NO_RANDOM;
    } else {
       ssize_t len = getrandom(buf, buflen, 0);
       err = (len ==  (ssize_t)buflen)?ERR_OK:ERR_NO_RANDOM;
    }
    return err;
}

@ewd340
Copy link

ewd340 commented May 28, 2022

Speaking of this interface, I got confused a little bit by the parameters order of the functions:

void crypto_stream_write_aead(crypto_stream_ctx *ctx,
                              uint8_t            mac[16],
                              uint8_t           *cipher_text,
                              const uint8_t     *ad        , size_t ad_size,
                              const uint8_t     *plain_text, size_t text_size);
int crypto_stream_read_aead(crypto_stream_ctx *ctx,
                            uint8_t           *plain_text,
                            const uint8_t      mac[16],
                            const uint8_t     *ad        , size_t ad_size,
                            const uint8_t     *cipher_text, size_t text_size);

Not a big deal, it is easy to read indeed, but I somewhat expected to have the same order (especially for the mac)

@fscoto
Copy link
Contributor

fscoto commented May 28, 2022

@ewd340 If I may... I have some input on getrandom and its relatives.

First of all: The cutoff is at 256 bytes, not 255 bytes. For everything else, read on.

getrandom(2) started as a Linux-specific syscall, but there are vaguely compatible interfaces in other systems, namely at least FreeBSD, Solaris and illumos (but not OpenBSD or macOS, more on those later). There is no standard to adhere to.

The 256-byte limitation is arbitrary (though likely there to avoid spending too much time in the kernel per syscall) and you may get a larger buffer filled but only partially. Since there is no real distinction between a syscall and a “normal” C standard library function, an operating system is free to implement a semantically identical userspace getrandom(3) instead, too. I therefore don't think you should guard the size limit pre-emptively. Just let the call fail and go from there. I would therefore instead write approximately this (snippet hereby placed under the CC0-1.0):

static enum error random_buf(uint8_t *buf, size_t buflen)
{
    size_t to_read;
    ssize_t len;

    while (buflen != 0) {
        to_read = buflen;
        len = getrandom(buf, to_read, 0);

        if (len > 0) {
            buf += len;
            buflen -= len;
        }

        if (len == -1 || (size_t)len < to_read) {
            switch (errno) {
            case EAGAIN:
            case EINTR:
                continue;
            default:
                return ERR_NO_RANDOM;
            }
        }
    }
    return ERR_OK;
}

(Since there is no standard, any implementation may also fail when less bytes are requested, but since most platforms will add it for compatibility with Linux, there is a natural incentive not to lower the maximum number of bytes returned.)

On a similar note: getrandom(2) is somewhat new, introduced in Linux 3.17 (2014); perhaps some embedded platform's vendor kernel doesn't have it. glibc added support in 2.25 (2017). If you have a new glibc but an old kernel, glibc may return ENOSYS, which is a permanent, non-user error, even though it may look like getrandom(2) exists on a naive check. Consider ensuring a fallback at runtime or a thorough compile-time functionality check in your build system if possible.


Miscellaneous portability notes born from frustration:

OpenBSD or macOS have not implemented getrandom(2). They use getentropy(2). The interface is intended to seed (cryptographically secure) RNGs in userspace with kernel randomness; e.g. OpenBSD arc4random(3) seeds itself with it. Applications are expected to directly call arc4random(3) instead. While almost all of the platforms mentioned thus far implement arc4random(3), there is one major platform that does not: Linux/glibc.

POSIX will, in a future issue, standardize getentropy() only.. There is no help coming by ways of the standards committees of the world.

A portable solution would, therefore, have to check in this order and at least partially at build time:

  • if you have arc4random(3), all is well (except on recent versions of the following: macOS < 10.12, where it failing to open /dev/random for seeding causes a bad seed, which is why LibreSSL doesn't trust it, and FreeBSD < 12.0, which used RC4 with according statistical bias but LibreSSL is okay with that);
  • if you have getrandom(2), check for failure;
  • if you have /dev/urandom, check for a lot of failures;
  • if you have getentropy(2), check for failure and... do something with the result to approximate seeding an RNG and immediately using its output I guess;
  • if you reach this point, process termination is your only reasonable option.

About Windows: Windows has a crypto API and it expects you to use it; getting random bytes involves setting up a cryptography context and then calling a function called BCryptGenRandom that uses it to generate random bytes. (The function also has "BCrypt" in its name, yet has nothing to do with the bcrypt hash algorithm.)

This isn't strictly necessary, however. Some people work around this limitation by calling into the deprecated RtlGenRandom() function, which doesn't require keeping state. This leads hacks such as the one found in libsodium.

@ewd340
Copy link

ewd340 commented May 28, 2022

Thanks a lot @fscoto for your very instructive input, I liked every bit of it! Thanks also for the better random_buffunction.

Re: portable solutions, I appreciate the details you provided, I wish every OS would just provide arc4random, but that's not the case, and I guess we'll have to deal with this reality. I, in fact, thought to myself that since this random_bufwill be used (in my case) only for generating a nonce, that maybe I could use a PRNG that would be seeded with ---pardon my naivety--- the current time?

@LoupVaillant
Copy link
Owner Author

LoupVaillant commented May 29, 2022

maybe I could use a PRNG that would be seeded with ---pardon my naivety--- the current time?

Goodness, no! (Edit: Oops, my mistake, I misunderstood the question.)

Current time is way too predictable. Not enough entropy to be extracted from it. Even if you get down to the millisecond, an attacker that can guess when you performed the seed down to the second will get only 10 bits of entropy to search through. And I’m being reasonable, the best attacks may be even better.

If you have an OS that can give you randomness, you should use it. If the system API is too horrible, you may initialise a user space RNG when the program starts up, and after that no more errors. User space RNGs are error prone though, multithreading and all that, so be careful.

And sometimes, you don’t even get randomness. Perhaps not even jitter entropy if you’re using a single process in-order embedded CPU. In that case, you need to come up with 32 random bytes from another source (like your desktop), record that seed in your chip, and seed your RNG with that. And of course, make sure you renew the seed every time you reboot, so you don’t end up reusing the same "random" stream over and over.

@ewd340
Copy link

ewd340 commented May 29, 2022

Thanks again @LoupVaillant and sorry if I'm diverting the discussion from its main topic with my randomness issues. I enjoy the very informative input I'm getting from you, guys.

The" Goodness, no!" made me chuckle... I saw it coming :) Still, I thought that, since the nonce need not be unpredictable (only unique per key, if I have understood correctly), I can get away with something basic. Anyways... even for a nonce, it won't hurt to have good entropy... so my laziness is not justified, I admit.

Back to the interface, again not a big deal, but I humbly think that the mac argument in crypto_stream_read_aead should be in the same position as in crypto_stream_write_aead. That is, something like this:

void crypto_stream_write_aead(crypto_stream_ctx *ctx,
                              uint8_t            mac[16],
                              uint8_t           *cipher_text,
                              const uint8_t     *ad        , size_t ad_size,
                              const uint8_t     *plain_text, size_t text_size);
int crypto_stream_read_aead(crypto_stream_ctx *ctx,
                            const uint8_t      mac[16],
                            uint8_t           *plain_text,
                            const uint8_t     *ad        , size_t ad_size,
                            const uint8_t     *cipher_text, size_t text_size);

Or is there a reasoning behind the proposed order that I don't get?

@LoupVaillant
Copy link
Owner Author

LoupVaillant commented May 29, 2022

Still, I thought that, since the nonce need not be unpredictable (only unique per key, if I have understood correctly), I can get away with something basic.

Oh, sorry, I misunderstood. I shouldn't haven answered when I was this tired. Yes, a nonce doesn't have to be unpredictable, only unique. If you have a reliable clock (that never display the same hour), you can use it. There's one big caveat though: make sure you don't run two encryptions at the same time.

Alternatively, you can maintain a counter, but the same reuse misuse that applies with userspace RNGs will apply to that counter.

Or is there a reasoning behind the proposed order that I don't get?

There is: in all of Monocypher, the arguments go in the following order:

  • context
  • output arguments
  • input arguments

And yes, sometimes this particular consistency sacrifices other kinds of consistency. Your proposal makes sense, it's just not the logic I chose to follow. Also, note the order of arguments for the current AEAD API:

void crypto_lock_aead(uint8_t        mac[16],
                      uint8_t       *cipher_text,
                      const uint8_t  key[32],
                      const uint8_t  nonce[24],
                      const uint8_t *ad        , size_t ad_size,
                      const uint8_t *plain_text, size_t text_size);
int crypto_unlock_aead(uint8_t       *plain_text,
                       const uint8_t  key[32],
                       const uint8_t  nonce[24],
                       const uint8_t  mac[16],
                       const uint8_t *ad         , size_t ad_size,
                       const uint8_t *cipher_text, size_t text_size);

Same logic, similar order. (We have a key and nonce instead of the context, so they don't come first, but the rest of the arguments are in the same order.)

(Edit: Alternatively, I could swap mac and ciphertext in the read() function and still preserve my logic, but it still wouldn't be consistent with the existing functions.)

@ewd340
Copy link

ewd340 commented May 29, 2022

There is: in all of Monocypher, the arguments go in the following order:

* context

* output arguments

* input arguments

This is it! I kind of felt it, but failed to see it. Thanks again.

LoupVaillant added a commit that referenced this issue Jan 10, 2023
Tests not complete, no documentation yet.

I'm currently rethinking the AEAD API as a whole, and to be honest I'm
so happy with this streaming API that I believe it could replace the
regular API entirely.

One problem with the AEAD API is the sheer number of arguments.
`crypto_lock_aead()` and `crypto_unlock_aead()` currently have 8
arguments, comprising 6 pointers (all of the same type) and 2 sizes.
There are way too many opportunities to swap arguments and break stuff.

The streaming API however is divided into an init phase, which has only
3 arguments, and a read/write phase, which has 7, but "only" 4 pointers
to byte buffers.  Which I don't think we can improve much.  We could try
and use a second struct similar to what we do with Argon2, but with only
7 arguments (compared to Argon2's 15) I don't think we would gain that
much readability.

As for how to use the streaming API for single shot uses, that's obvious
enough:

- Declare the context and call Init.
- Call read/write.
- Wipe the context.

One may argue that everything else (Poly1305, Blake2b, SHA-512, and
HMAC) provide a one-shot API, and we should do so here as well.  There's
just one problem: we don't have one init function, we have _three_.

If we provide a one-shot API, orthogonality would need all 3 variants.
That's 6 functions total (3 locks, 3 unlocks), which is a bit much,
especially since at least one of them is only provided for compatibility
with a standard I don't entirely agree with.  We could of course only
provide only the single one-shot API (like we do right now), but that
leaves such an obvious hole in the API.

Stopping at just the 5 functions we need for everything (streaming,
one-shot, all 3 variants) is very tempting.

See #218
@LoupVaillant
Copy link
Owner Author

The streaming interface has been added, and the designed gelled to what I believe to be a satisfactory design.

There is one thing I must explain that's not in the message commit: the choice to not imitate libsodium. There are two reasons:

  1. With my design, the direct interface is implemented in terms of the streaming interface. All fully compatible, and users can have full RFC compliance on top. It also requires less code, makes everything simpler… the works.
  2. Libsodium provides an additional tag that I don't. One reason I don't is the systematic overhead incurred by their tag (each chunk is 1 byte bigger, and in Libsodium case's requires processing an additional Chacha block). This tag is not needed in all cases. Very often it can be transmitted implicitly by knowing the size of the input in advance, or simply by closing the connection (and change the additional data for the last block). In cases where that's not possible (can't close the connection, can't know the size in advance…), users can implement this tag on top of this interface.

Oh, and about this key commitment thing an partition oracle attacks: that should be addressed by committing the key at the beginning of the stream, typically by outputting a hash derived from the same seed as the encryption key.

With this, I believe we are done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants